Skip to content

dev-vivekkumarverma/airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

airflow


For tutorial:

https://www.youtube.com/watch?v=K9AnJ9_ZAXE&list=PLwFJcsJ61oujAqYpMp1kdUBcPG0sE0QMT

  1. What is airflow and why do we need it ?
    • Airflow is a workflow orchestration platform that allows users to programmatically create, schedule, and monitor workflows. It's often used to automate machine learning tasks and create complex data pipelines.

Here's a well-structured guide for setting up Apache Airflow in a virtual environment and using the current working directory (.) as AIRFLOW_HOME.

I'll also highlight the commands that should go in README.md for easy reference.


🚀 Apache Airflow Local Setup Guide (Using Virtual Environment & Local Directory)

📌 Overview

This guide covers:
✅ Installing Airflow inside a Python virtual environment
✅ Using the current directory (.) as AIRFLOW_HOME
✅ Running Airflow webserver and scheduler
✅ Managing DAGs and users


🛠️ Prerequisites

Ensure you have:

  • Python 3.11 installed
  • pip, venv, and other required system packages
  • Enough disk space and proper permissions

1️⃣ Install Dependencies (Before Setup)

For Ubuntu/Debian:

sudo apt update
sudo apt install python3.11 python3.11-venv python3.11-distutils python3-pip -y

For macOS (Using Homebrew):

brew install [email protected]

For Windows:

  1. Download Python 3.11 from python.org.
  2. During installation, check the box: "Add Python to PATH".
  3. Open PowerShell and run:
    python -m ensurepip

2️⃣ Set Up Virtual Environment

🚀 (Add These Commands to README.md)

# Navigate to your project directory
cd ~/your-project-folder  # Change this to your actual folder

# Create and activate a virtual environment
python3.11 -m venv airflow-env
source airflow-env/bin/activate  # Linux/macOS
airflow-env\Scripts\activate     # Windows

# Verify Python version inside the virtual environment
python --version

3️⃣ Set AIRFLOW_HOME to Current Directory (.)

🚀 (Add These Commands to README.md)

# Set Airflow to use the current directory
export AIRFLOW_HOME=$(pwd)  # Linux/macOS
set AIRFLOW_HOME=%cd%       # Windows

# Add this to your ~/.bashrc or ~/.zshrc to make it persistent
echo 'export AIRFLOW_HOME=$(pwd)' >> ~/.bashrc
source ~/.bashrc

4️⃣ Install Apache Airflow

🚀 (Add These Commands to README.md)

pip install --upgrade pip
pip install apache-airflow==2.7.1

Verify Installation

airflow version

5️⃣ Initialize Airflow Database

🚀 (Add These Commands to README.md)

airflow db init

This will create:

  • airflow.cfg → Airflow configuration file
  • airflow.db → SQLite database (for local use)

Check if the files are created in the current directory:

ls -l | grep airflow

6️⃣ Start Airflow Webserver and Scheduler

🚀 (Add These Commands to README.md)

# Start the Airflow web server (Runs on port 8080 by default)
airflow webserver --port 8080

# Open in browser: http://localhost:8080

# In a separate terminal, start the scheduler
airflow scheduler

7️⃣ Create an Admin User

🚀 (Add These Commands to README.md)

airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email [email protected]

🛠 Now, log in to the Airflow UI at http://localhost:8080 with the admin credentials.


8️⃣ Add DAGs to dags/ Directory

🚀 (Add These Commands to README.md)

mkdir -p dags
  • Place your DAG Python files inside the dags/ directory.
  • Example DAG (dags/example_dag.py):
    from airflow import DAG
    from airflow.operators.dummy import DummyOperator
    from datetime import datetime
    
    with DAG('example_dag', start_date=datetime(2024, 1, 1), schedule_interval="@daily") as dag:
        start = DummyOperator(task_id="start")

Activate DAGs in UI:

  1. Start the scheduler:
    airflow scheduler
  2. Enable the DAG in Airflow UI (http://localhost:8080).

9️⃣ Stop Airflow and Deactivate Virtual Environment

🚀 (Add These Commands to README.md)

# Stop Airflow (Find and kill processes)
pkill -f "airflow webserver"
pkill -f "airflow scheduler"

# Deactivate virtual environment
deactivate

🚀 Summary of Commands for README.md

# Install dependencies
sudo apt update
sudo apt install python3.11 python3.11-venv python3.11-distutils python3-pip -y

# Create and activate virtual environment
python3.11 -m venv airflow-env
source airflow-env/bin/activate  # (Linux/macOS)
airflow-env\Scripts\activate     # (Windows)

# Set AIRFLOW_HOME to current directory
export AIRFLOW_HOME=$(pwd)
echo 'export AIRFLOW_HOME=$(pwd)' >> ~/.bashrc
source ~/.bashrc

# Install Apache Airflow
pip install --upgrade pip
pip install apache-airflow==2.7.1

# Initialize Airflow database
airflow db init

# Start Airflow webserver and scheduler (run in separate terminals)
airflow webserver --port 8080
airflow scheduler

# Create an admin user
airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email [email protected]

# Create DAGs directory
mkdir -p dags

# Stop Airflow and deactivate environment
pkill -f "airflow webserver"
pkill -f "airflow scheduler"
deactivate

WORKFLOW IS SOMETHING LIKE

WORKFLOW -> DAG(when to do what) -> TASK(what to do) -> OPERATOR (how to do)

alt workflow

DAG, tasks and operators

alt task and operators

DAG, TASK AND OPERATORs internals alt datg, tasks, operators internals

TAST LIFECYCLE

alt TAST LIFECYCLE

Releases

No releases published

Packages

No packages published

Languages