Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technical Paper Proposal: "Infrastructure Lifecycle" #759

Open
rynowak opened this issue Oct 15, 2024 · 8 comments
Open

Technical Paper Proposal: "Infrastructure Lifecycle" #759

rynowak opened this issue Oct 15, 2024 · 8 comments

Comments

@rynowak
Copy link
Collaborator

rynowak commented Oct 15, 2024

👋 WG-Infrastructure-Lifecycle dropping in to say hello 👋

TAG AppDelivery has adopted a new process for writing and publishing technical papers. This issue is our initial proposal for writing a paper as one of the working group's deliverables.

You can learn more about the working group here. We'd love to have you participate and contribute along with us!


Title: Infrastructure Lifecycle

Description:

As the cloud native approach matures, the workloads we run have increasingly complex infrastructure needs. While we all strive to control costs, enforce best practices, and ensure secure configurations, the reality is often fragmented.

Despite the complexity and sophistication required, not enough has been done to meet the challenges. We're seeing significant investment in new open source infrastructure projects both in and out of CNCF but effective tooling for cloud native infrastructure lifecycle management remains elusive. The Platform Engineering movement emphasizes treating infrastructure as a product, but there's no standardized approach for managing its lifecycle.

While savvy users are embracing cloud native practices, infrastructure requirements are inherently diverse. We see an opportunity to champion technology-agnostic best practices. Infrastructure lifecycle management deserves the same level of attention and planning we dedicate to established areas of cloud native development. This ensures security, resilience, manageability, sustainability, and observability.

Audience:

Any end-user involved in or responsible for the management of cloud-native infrastructure - regardless of job title, workload, or chosen technologies.

Impact:

The whitepaper will guide end-users in managing infrastructure to ensure it is secure, resilient, manageable, sustainable, and observable. Any end-user, regardless of their role or technologies choices can leverage the whitepaper’s guidance to implement a mature and stable infrastructure management practice.

Scope:

This scope of this whitepaper covers a set of recommended practices and maturity guidelines that are generally applicable and technology agnostic. There are many complex domain-specific and workload-specific areas of infrastructure management, and we’d like to avoid going deep in any specific area to serve the biggest possible audience.

Our initial proposed set of topics (non-exhaustive):

Configuration / Infrastructure as Code

  • Development processes
  • Design and abstractions (aka. componentization / modularization)
  • Testing

Deployment approaches (Infra as Data, etc.)

  • Delivery across environments, artifacts, versioning
  • Application-aligned (spec the infra for an application and deploy it) vs. shared infra (e.g. clusters), vs. horizontal / siloed

State management/backups

  • Disaster recovery
  • Availability (not just DR but also managing scaling, resilience)

Observability

  • KPIs/Metrics
  • Policy enforcemnt

This sounds like a lot, even for an initial list. It's a really big topic and we'd rather deliver a small amount of a good content than a large amount of questionable content. We need to begin the process and see where the contributor/user interest is to refine this list more.

We've got a clear picture right now on some topics that definitely are out-of-scope:

  • Going deep on specific infrastructure-management technologies. eg: User-guide for OpenTofu
  • Going deep into the management of specific infrastructure. eg: How to failover and backup PostgreSQL servers
  • Going deep into the management of specific workloads. eg: Comprehensive guidance for websites + global CDNs
  • Going deep into specific industry-verticals. eg: Healthcare
  • Physical infrastructure. eg: Servers, racks, CPUs.

Also, there are potentially many overlaps between other CNCF guidance and our proposed whitepaper. For example, many areas of the infrastructure lifecycle are security-critical. We plan to surface the existing guidance created by others where possible, and to collaborate with other TAGs and WGs.

@rynowak
Copy link
Collaborator Author

rynowak commented Oct 15, 2024

Hi folks, we're trying to assess interest in the topic and find contributors who want to work on this paper with us. We plan to present this at the TAG App-Delivery general meeting on 10/16 (schedule willing).

The best way to get involved is to comment on this issue OR to reach out to us on CNCF slack in the #wg-infrastructure-lifecycle channel OR to attend a working group meeting.

You can learn more about the working group here. We'd love to have you participate and contribute along with us!

@kief
Copy link

kief commented Oct 18, 2024

Great! I've been coming to the calls and am looking forward to helping out with the paper.

@bschaatsbergen
Copy link
Collaborator

Hey @rynowak, thanks for delivering this outline. IIRC, @elft3r is also working on an outline—maybe we can merge the two and use that as a base to draft an RFC?

@rynowak
Copy link
Collaborator Author

rynowak commented Dec 3, 2024

That sounds great! I'll follow up with @elft3r

@elft3r
Copy link
Collaborator

elft3r commented Dec 24, 2024

Hey everyone! I took a shot at putting together an outline. The version below is a mix of what's in the description, our discussions, meeting notes, and my original ideas. I’d love to hear your thoughts on it so we can tweak it together.

  • Introduction
    • Importance of lifecycle management in cloud-native environments
    • Relationship to cloud-native applications
  • Infrastructure in this context?
    • Definition of cloud-native infrastructure
    • Differentiation from traditional infrastructure
    • Distinction between infrastructure and platform
  • Infrastructure as Code (IaC)
    • Design and abstractions (componentization / modularization)
    • Development process and best practices
    • Testing strategies (unit, integration, compliance)
    • Language attributes (general-purpose vs. DSL, imperative vs. declarative)
    • Version control and collaboration
  • Deployment Approaches
    • Infrastructure as Data
    • GitOps for infrastructure
    • Continuous Integration/Continuous Deployment (CI/CD) for infrastructure
    • Application-aligned vs. shared infrastructure vs. horizontal/siloed approaches
    • Multi-cloud and hybrid cloud considerations
  • State Management
    • Desired state vs. current state reconciliation
    • Drift detection and remediation
    • Disaster recovery strategies
    • High availability, scaling, and resilience
  • Observability and Monitoring
    • KPIs/Metrics for infrastructure health
    • Logging and tracing
    • Policy enforcement and compliance monitoring
    • Security monitoring and threat detection
  • Lifecycle Stages
    • Provisioning and deployment
    • Configuration management
    • Updates and patching
    • Decommissioning and resource reclamation
  • Future Trends and Outlook

@bschaatsbergen
Copy link
Collaborator

bschaatsbergen commented Jan 8, 2025

Thank you for putting in the work here @elft3r, @rynowak—I’ve reviewed both proposed outlines, combined them, and added a few elements of my own. I’ve also revised and rephrased certain topics and reordered sections to establish a stronger foundational context before diving into the details. See the revised outline below.

  1. Introduction

    • Importance of lifecycle management in cloud-native environments
    • Relationship to cloud-native applications
  2. Cloud-Native Infrastructure

    • Definition of cloud-native infrastructure
    • Differentiation from traditional infrastructure
    • Distinction between infrastructure and platform
  3. Infrastructure Lifecycle Stages

    • Infrastructure provisioning and deployment
    • Configuration management and automation
    • Managing updates and patching processes
    • Decommissioning resources and reclamation strategies
  4. Infrastructure as Code (IaC)

    • Language paradigms: general-purpose vs. DSL, imperative vs. declarative
    • Development workflows and best practices
    • Designing abstractions: componentization and modularization
    • Version control and collaborative workflows
    • Testing strategies: unit, integration, compliance, etc.
  5. State Management

    • Reconciliation of desired state vs. current state
    • Drift detection and automated remediation
    • Disaster recovery strategies
    • Ensuring high availability, scalability, and resilience
  6. Deployment Approaches

    • GitOps for infrastructure
    • Continuous Integration/Continuous Deployment (CI/CD) for infrastructure
    • Considerations for multi-cloud and hybrid cloud deployments
    • Application-aligned vs. shared infrastructure vs. horizontal/siloed approaches
  7. Observability and Monitoring

    • Key performance indicators (KPIs) and metrics for infrastructure health
    • Comprehensive logging and distributed tracing
    • Policy enforcement and compliance monitoring
    • Security monitoring and threat detection
  8. Future Trends and Outlook

    • Emerging patterns in cloud-native infrastructure management
    • Innovations in platform engineering and automation
    • Predictions for next-generation infrastructure tooling

@TBeijen
Copy link

TBeijen commented Jan 9, 2025

Looks good. Some small things that spring to mind:

  • Disconnect between what's provisioned and resulting resources (think: autoscaling group resulting in ec2s, k8s operators). Maybe it needs a special mention, maybe part of 'lifecycle'.
  • Repeatability (probably a best practices subtopic).

@bschaatsbergen bschaatsbergen changed the title Technical Paper Proposal: Cloud Native Infrastructure-Lifecycle Technical Paper Proposal: "Infrastructure Lifecycle" Jan 9, 2025
@abangser
Copy link
Collaborator

Thanks for sharing @bschaatsbergen! I know I haven't been around much, so please take this observation with that in mind as it may be unfounded and/or not relevant to you all.

In reviewing I think these topics are all great and actually mirror a lot of the chapters that @kief wrote about in his book though definitely with a different focus point.

So my concern is not with the topics, but with understanding why we believe the CNCF WG is uniquely positioned to add to the discourse. I know that the new guidelines/support for writing tech papers only came out recently but I think it has some interesting points. Namely target values like "Papers that serve as ecosystem guides inform cloud native projects and community members on a particular topic" and "These papers may establish decision or technology frameworks that allow adopters to reason about cloud native and open source projects to achieve a particular use case".

I for one have had to recenter myself on the fact that the TAG App Delivery and associated WGs have been more focused on the wider use of tech and higher level concepts. Not focusing enough on the other side of the equation, the projects and how to identify gaps and guide towards improvements to align with these industry trends. But in reviewing the objectives for TAGs it is clear we have drifted from that project/ecosystem focus.

I am only raising this as I am having the same thoughts and chats about some of the work that is ongoing in the Platforms WG and figure if you can catch this at the early stage of this white paper you may have less rework/challenges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants