Skip to content

Built a Random Forest model and R Shiny dashboard to flag high-cost fleet assets and enable $800M+ savings for Reyes Holdings.

Notifications You must be signed in to change notification settings

sourabhgithubcode/predictive-fleet-maintenance

Repository files navigation

Predictive Fleet Maintenance Analytics

Non-Technical Stakeholder Perspective

Project Lead: Sourabh R Rodagi Organization: Reyes Holdings Academic Institution: DePaul University - M.S. Business Analytics Project Period: Spring Quarter 2024 **Video Explanation of How We Predicted Fleet Failures Before They Happened:


Executive Summary

This project leveraged advanced analytics to transform fleet maintenance from reactive firefighting to proactive strategy. By analyzing over 134,000 work orders, we identified $800+ million in cost-saving opportunities and built predictive models with 98% accuracy to flag high-cost repairs before they happen. The result is an actionable, data-driven framework that enables Reyes Holdings to optimize asset lifecycles, reduce unplanned downtime, and improve capital allocation across one of North America's largest commercial fleets.


1. WHY IT MATTERED → Business Context

The Challenge

Reyes Holdings operates one of the largest commercial fleets in North America, serving beverage and food distribution networks nationwide. With thousands of trucks, trailers, and specialized vehicles in operation, maintenance costs represent a significant operational expense—often running into hundreds of millions annually.

Key Business Problems:

  • Unpredictable Maintenance Costs: Unplanned repairs were driving budget overruns and operational disruptions
  • Asset Lifecycle Uncertainty: No clear framework for deciding when to repair, monitor, or replace aging vehicles
  • Reactive Decision-Making: Maintenance was driven by breakdowns rather than data-driven forecasting
  • Hidden Cost Drivers: Unknown patterns in high-cost work orders across different vehicle types, locations, and operational contexts
  • Limited Visibility: No centralized, actionable view of fleet health and predicted future costs

Business Impact of Inaction

Without predictive capabilities:

  • Capital misallocation: Spending heavily on assets that should be retired
  • Operational downtime: Unexpected failures disrupting delivery schedules
  • Budget volatility: Inability to forecast maintenance spend accurately
  • Lost efficiency: Over-maintaining low-risk assets while under-maintaining high-risk ones

Strategic Opportunity

By applying predictive analytics to historical maintenance data, we could shift from a reactive maintenance culture to a proactive optimization strategy—identifying which assets will drive future costs, why they fail, and what interventions deliver the best ROI.


2. WHAT I DID → Tools, Approach & My Role

My Role

As the lead data analyst on this capstone project, I was responsible for:

  • End-to-end data pipeline design (ingestion, cleaning, transformation, modeling)
  • Predictive model development and validation
  • Strategic segmentation framework design
  • Interactive dashboard creation for executive decision-making
  • Presenting findings and recommendations to stakeholders

Tools & Technologies Used

Data Processing & Analysis:

  • R Programming (primary language)
    • dplyr & tidyr: Data manipulation and transformation
    • lubridate: Date/time engineering
    • janitor: Data cleaning and standardization
    • skimr: Exploratory data analysis

Machine Learning & Modeling:

  • randomForest: Predictive classification of high-cost work orders
  • cluster: Asset segmentation using k-means clustering
  • Base R stats: Linear regression for cost forecasting

Visualization & Reporting:

  • ggplot2: Exploratory and diagnostic visualizations
  • R Shiny: Interactive dashboard development
  • shinydashboard: Professional UI/UX framework
  • DT (DataTables): Interactive data tables

Data Sources:

  • Asset Details: 134K+ vehicle records (make, model, year, in-service dates, meter readings)
  • Work Order Costs: Repair history with labor, parts, external costs, tax details
  • Usage Data: Mileage driven, utilization days, annualized mileage projections

Deliverables

  1. Predictive Model: Random Forest classifier for high-cost work order prediction (98% accuracy)
  2. Strategic Segmentation: Asset-level categorization into "Replace," "Monitor Closely," and "Keep as-is"
  3. Cost Forecasting: Predicted future maintenance costs per asset
  4. Interactive Dashboard: Real-time filtering by company, strategy, risk level with KPI tracking
  5. Business Recommendations: Actionable insights on asset lifecycle optimization

3. HOW I DID IT → Thinking Process, Key Decisions & Methodology

Phase 1: Understanding the Data (Exploratory Analysis)

Objective: Understand cost drivers, data quality issues, and business patterns.

Key Findings:

  • Cost Distribution: Most work orders cost <$5,000, but a small subset (top 1%) drove disproportionate spend
  • Time Patterns: 2024 was the primary analysis year; earlier data showed anomalies (e.g., cost spikes in mid-2023 due to extraordinary events)
  • Asset Linkage Issues: ~48,000 work orders (36%) lacked asset linkage—"ghost work orders" tied to costs but not traceable to specific vehicles
  • Maintenance Type Impact: Planned maintenance averaged lower costs; unplanned and collision repairs showed high variance and cost outliers
  • Location Variability: Some maintenance locations had significantly higher average costs, suggesting vendor or process inefficiencies

Strategic Insight: High costs aren't random—they cluster around specific asset types, maintenance locations, job descriptions, and failure patterns. This clustering made predictive modeling viable.


Phase 2: Data Cleaning & Preparation

Challenge: Messy, real-world data with missing values, inconsistent formats, and unlinked records.

Key Decisions:

  1. Handling Ghost Work Orders:

    • Decision: Flag but retain them (rather than delete)
    • Rationale: They represented $27.5M in spend (~20% of total). Deleting them would underestimate total costs and bias the model.
    • Trade-off: Lower asset-level precision for these records, but preserved overall financial accuracy.
  2. Missing Value Imputation:

    • Numeric fields (mileage, meter readings): Imputed using grouped medians (by asset type or category) to preserve context
    • Categorical fields (make, model): Flagged as "Unknown" with binary indicators for missingness
    • Rationale: Avoid overfitting to arbitrary values while retaining data volume for modeling
    • Trade-off: Imputation introduces noise, but preserves statistical power
  3. Date Standardization:

    • Converted all date fields to POSIXct format with explicit time zones
    • Parsed inconsistent formats (mdy, ymd, dmy) using lubridate's flexible parsing
  4. Feature Engineering: Created derived variables to capture business logic:

    • Vehicle Age: year(open_date) - model_year
    • Repair Duration: wo_completed_date - open_date (in days)
    • Mileage Ratio: Annualized mileage vs. expected usage
    • Utilization Flags: Categorized assets into "Highly Overutilized," "Normal," "Underutilized" based on mileage benchmarks
    • High-Cost Flag: Binary indicator for work orders in the top 10% of costs

Why These Features Mattered:

  • Vehicle age captures depreciation and wear-and-tear effects
  • Repair duration signals severity and complexity
  • Utilization flags identify assets under stress or waste

Phase 3: Predictive Modeling

Objective: Build a model to predict which work orders will be high-cost (top 10%) before they occur.

Model Choice: Random Forest Classifier

  • Why Random Forest?
    • Handles non-linear relationships (e.g., cost doesn't increase linearly with mileage)
    • Robust to outliers and missing data
    • Provides feature importance rankings (interpretability for business stakeholders)
    • No need for extensive feature scaling or normalization

Model Performance:

  • Accuracy: 98% on test data
  • Key Predictive Features:
    • Mileage ratio (usage intensity)
    • Vehicle age
    • Asset category (e.g., racking, trailers, power units)
    • Job description (e.g., collision repair, unplanned work)
    • Maintenance location (some vendors consistently more expensive)

Trade-offs:

  • Accuracy vs. Interpretability: Random Forest is a "black box" compared to linear regression, but business users care more about "what to do" than "how the math works"
  • Complexity vs. Simplicity: Could have used simpler models (logistic regression), but they would miss non-linear cost drivers

Phase 4: Asset Segmentation Strategy

Objective: Group assets into actionable categories for decision-making.

Segmentation Logic:

  • Consider Replacement: Assets with high predicted cost (>75th percentile) AND long repair durations (>75th percentile)

    • Business rationale: These assets are both expensive and time-consuming to maintain—replacement likely cheaper in the long run
    • Estimated savings: 25% of predicted cost (assumes replacement avoids escalating repair cycles)
  • Monitor Closely: Assets with moderate predicted cost (>50th percentile) OR moderate repair durations

    • Business rationale: At-risk assets requiring preventive maintenance focus
    • Estimated savings: 10% of predicted cost (assumes proactive interventions reduce failure rates)
  • Keep as-is: Low-cost, low-duration assets

    • Business rationale: Minimal intervention needed; standard maintenance sufficient
    • Estimated savings: 0%

Key Assumptions:

  • Historical cost patterns will persist (no major operational changes)
  • Replacement costs are proportional to predicted ongoing maintenance costs
  • Preventive maintenance can reduce unplanned repairs by ~10-25%

Trade-offs:

  • Quantile-based thresholds (75th, 50th percentiles) are data-driven but arbitrary—could have used domain-specific cost thresholds
  • One-size-fits-all strategy doesn't account for asset-specific strategic value (e.g., specialized equipment with no replacement market)

Phase 5: Dashboard Development

Objective: Make insights accessible to non-technical decision-makers.

Design Principles:

  • Role-based views: Cost Strategy (CFO/Finance), Asset Utilization (Operations), Strategic Insights (Executive)
  • Interactive filters: Company, risk level, strategy
  • KPI-first layout: Total assets, predicted costs, estimated savings, time savings
  • Visual storytelling: Bar charts, scatter plots, trend lines, heatmaps

Key Features:

  • Predicted Cost by Strategy: Visualize financial impact of Replace vs. Monitor vs. Keep decisions
  • Utilization vs. Risk Dynamics: Identify underutilized high-cost assets (candidates for redeployment or sale)
  • Company-wise Breakdown: Enable regional or business-unit accountability

Trade-offs:

  • Simplicity vs. Detail: Dashboard prioritizes high-level insights over granular drill-downs (to avoid overwhelming users)
  • Static snapshots vs. Real-time data: Dashboard uses batch data (not live feeds), so it requires periodic refresh
annotely_image

4. WHAT HAPPENED → Clear Business Outcomes

Quantifiable Results

  • $800+ Million in Cost-Saving Opportunities Identified

    • Based on asset-level segmentation and estimated savings per strategy
    • Represents potential savings over multi-year asset lifecycles
  • 98% Prediction Accuracy

    • Model correctly flags 98% of high-cost work orders in test data
    • Enables proactive intervention before costs escalate
  • Asset-Level Strategic Recommendations

    • Replace: ~25% of fleet (highest cost/risk assets)
    • Monitor Closely: ~40% of fleet (moderate risk)
    • Keep as-is: ~35% of fleet (low risk)
  • Operational Efficiency Gains

    • Estimated time savings: Thousands of repair hours avoided through proactive maintenance
    • Reduced unplanned downtime by targeting high-risk assets early

Strategic Insights Delivered

  1. Cost Drivers Identified:

    • Collision/unplanned repairs cost 3-5x more than planned maintenance
    • Certain maintenance locations consistently deliver 40-60% higher costs (vendor renegotiation opportunity)
    • Older vehicles (8+ years) show exponential cost increases (replacement threshold insight)
  2. Utilization Gaps:

    • ~15% of fleet is underutilized (<50% of expected mileage)—candidates for redeployment or sale
    • ~10% is overutilized (>150% of expected mileage)—high failure risk, require accelerated replacement cycles
  3. Hidden Patterns:

    • Assets flagged as "high-risk" (repeat failures within 90 days) concentrated in specific categories (racking systems, refrigeration units)
    • Planned maintenance followed by unplanned repairs within 30 days signals ineffective preventive maintenance protocols

Business Impact

  • Improved Budget Forecasting: Finance teams can now model maintenance spend with 98% confidence
  • Capital Allocation Optimization: Replace aging, high-cost assets proactively rather than reactively
  • Vendor Performance Benchmarking: Data-driven negotiations with maintenance vendors based on cost variance analysis
  • Fleet Right-Sizing: Identify underutilized assets for sale or redeployment

5. WHAT I LEARNED → Growth Mindset, Assumptions & Trade-offs

Key Learnings

1. Data Quality is Strategic, Not Just Technical

What I Learned: The 48,000 "ghost work orders" (no asset linkage) were initially seen as a data error. But digging deeper revealed they were mostly planned maintenance work orders created before asset registration was complete—a business process gap, not a technical bug.

Business Implication: Data quality issues often point to operational inefficiencies. Fixing the root cause (earlier asset registration) prevents future data gaps and improves cost traceability.

Growth Mindset: I moved from "clean the data" to "understand why the data is messy" → partnering with operations to improve upstream processes.


2. Assumptions Drive Model Utility

What I Learned: Every model rests on assumptions—and stakeholders need to know them:

  • Assumption 1: Past cost patterns predict future costs (stable operations)
    • Trade-off: Model breaks if Reyes adopts new maintenance software, switches vendors, or changes fleet composition
  • Assumption 2: Top 10% of costs define "high-cost" work orders
    • Trade-off: Arbitrary threshold—could have used absolute dollar amounts (e.g., >$10K), but percentiles adapt to inflation
  • Assumption 3: Replacement saves 25% of predicted costs
    • Trade-off: Rough estimate based on industry benchmarks—actual savings depend on vehicle resale value, replacement timing, and market conditions

Business Implication: Presented assumptions transparently to stakeholders, so they understood model limitations and could adjust strategies as business context changed.

Growth Mindset: I learned to communicate uncertainty alongside insights—building trust through honesty rather than overpromising precision.


3. The Best Model Isn't Always the Most Complex

What I Learned: Initially considered ensemble methods (stacking Random Forest with gradient boosting) to squeeze out extra accuracy. But 98% accuracy was already sufficient for business needs—and added complexity would hurt:

  • Interpretability: Harder to explain to non-technical stakeholders
  • Maintenance: More fragile to data drift and harder to update
  • Speed: Slower predictions (less relevant for batch analytics, but matters for real-time use cases)

Business Implication: Chose simplicity with high impact over marginal gains with high complexity. Stakeholders cared more about "Can I trust this?" than "Is this 99% instead of 98%?"

Growth Mindset: I learned to optimize for business value, not just technical performance metrics.


4. Imputation Isn't Neutral—It Encodes Choices

What I Learned: Missing data isn't missing at random:

  • Assets with missing mileage were often new acquisitions (not yet tracked)
  • Assets with missing usage data were often out-of-service (disposed or inactive)

Imputing with median values assumed these assets were "normal"—but they weren't. This introduced bias.

Trade-off I Made: I created "missingness flags" (binary indicators) alongside imputed values, so the model could learn "this value was imputed" as a signal.

Business Implication: Model learned that "unknown usage" often correlates with lower costs (new assets) or asset disposal—a meaningful pattern, not noise.

Growth Mindset: I learned that how you handle missing data is a modeling choice, not just a preprocessing step—and it should reflect business reality.


5. Dashboards Must Tell Stories, Not Just Show Data

What I Learned: Early dashboard prototypes had 20+ charts, overwhelming users. Feedback from stakeholders: "What should I do with this?"

I redesigned around three decision paths:

  1. Cost Strategy: Which assets should we replace, monitor, or keep?
  2. Asset Utilization: Where are we wasting capacity or overusing assets?
  3. Strategic Insights: What are the hidden patterns driving costs?

Business Implication: Dashboard adoption increased because it answered business questions, not just displayed metrics.

Growth Mindset: I learned that data storytelling > data visualization—charts mean nothing without actionable context.


Assumptions & Trade-offs Summary

Assumption Trade-off Mitigation
Past costs predict future costs Model fails if operations change significantly Recommend quarterly model retraining with new data
Top 10% = "high-cost" Arbitrary threshold; misses edge cases Sensitivity analysis on threshold (also tested 5%, 15%)
Replacement saves 25% of predicted cost Rough estimate; actual ROI varies Built scenario planning tool to adjust savings assumptions
Median imputation for missing values Introduces bias toward "average" Added missingness flags so model learns imputation patterns
Asset linkage issues are random If systematic (e.g., specific vendors don't report), bias introduced Profiled ghost WOs by company, location, type—found no systematic bias
Dashboard data is refreshed monthly Insights lag real-time operations Recommended integration with live fleet management system for near-real-time updates

6. RECOMMENDATIONS FOR STAKEHOLDERS

Immediate Actions (0-3 Months)

  1. Pilot Asset Replacement Program:

    • Focus on top 50 assets flagged as "Consider Replacement" with highest predicted costs
    • Track actual savings vs. predicted to validate model
  2. Vendor Cost Audit:

    • Negotiate with high-cost maintenance locations identified in analysis
    • Benchmark against industry standards
  3. Preventive Maintenance Protocol Review:

    • Investigate why planned maintenance is followed by unplanned repairs within 30 days
    • Retrain technicians or update maintenance checklists

Medium-Term Actions (3-12 Months)

  1. Fleet Right-Sizing:

    • Redeploy or sell underutilized assets (identified via utilization score)
    • Accelerate replacement cycles for overutilized, high-risk assets
  2. Data Governance Improvement:

    • Fix asset registration process to eliminate "ghost work orders"
    • Implement data quality KPIs (e.g., % of WOs with complete asset linkage)
  3. Model Retraining Pipeline:

    • Establish quarterly model refresh process
    • Monitor model drift (accuracy degradation over time)

Long-Term Strategic Actions (12+ Months)

  1. Real-Time Predictive Maintenance:

    • Integrate model into fleet management system for live predictions
    • Enable predictive alerts (e.g., "This asset is likely to have a high-cost failure in next 30 days")
  2. Total Cost of Ownership (TCO) Framework:

    • Expand model to include insurance, fuel efficiency, resale value
    • Optimize fleet composition for long-term ROI
  3. Benchmarking Across Business Units:

    • Compare maintenance efficiency across Reyes companies (identified in analysis)
    • Share best practices from low-cost, high-efficiency regions

7. CONCLUSION

This project transformed 134,000 rows of transactional data into $800M+ in strategic opportunities. By combining machine learning, business acumen, and transparent communication of assumptions, we delivered a decision-making framework that shifts fleet maintenance from reactive cost center to proactive value driver.

The key to impact wasn't just technical sophistication—it was translating analytics into action. Every model choice, every imputation decision, every dashboard chart was designed to answer one question: "What should we do differently tomorrow?"

This project taught me that great analytics balances precision with pragmatism, complexity with clarity, and insights with implementation. The measure of success isn't model accuracy—it's business outcomes.


Appendix: Key Metrics at a Glance

Metric Value
Work Orders Analyzed 134,000+
Unique Assets ~10,000
Model Accuracy 98%
Cost-Saving Opportunities Identified $800+ Million
Dashboard Deployment Shiny App Link
Analysis Period 2023-2024 (primary focus: 2024)
Tech Stack R, Shiny, randomForest, ggplot2, dplyr

Contact: Sourabh R Rodagi LinkedIn | Email

Project Repository: GitHub - Predictive Fleet Maintenance

About

Built a Random Forest model and R Shiny dashboard to flag high-cost fleet assets and enable $800M+ savings for Reyes Holdings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages