Building the Modern Data Stack

The key is to start simple and add complexity only when justified by actual needs, not anticipated ones

Bidya Bhushan Bibhu

A Deep Dive into DuckDB-Powered Analytics

The Evolution of Data Architecture

Remember the days when setting up a data stack meant juggling multiple complex systems? I’ve been there. From managing temperamental Hadoop clusters to debugging ETL pipelines at 3 AM, I’ve experienced the full spectrum of data engineering challenges. But here’s the thing – the landscape is changing, and DuckDB is leading this transformation.

Why DuckDB is a Game-Changer in 2024

The Traditional Stack Problems

Before we dive in, let’s address the elephant in the room. The typical modern data stack looks something like this:

Fivetran/Airbyte for data ingestion ($$$)
Snowflake/BigQuery for storage and processing ($$$$)
dbt for transformations
Airflow for orchestration
Preset/Metabase for visualization

Total cost? Easily $50k+ annually for a medium-sized setup. Not to mention the operational complexity.

Enter DuckDB: The Swiss Army Knife of Analytics

What makes DuckDB special?

Here’s what I’ve discovered after using it in production for the last year:

Performance That Rivals the Big Players

Sample Query

-- A query that processes 1 billion rows in seconds
SELECT 
    DATE_TRUNC('day', timestamp) as date,
    COUNT(*) as events,
    COUNT(DISTINCT user_id) as users
FROM events
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;

Zero Configuration Required
Unlike Postgres or MySQL, you don’t need to:

Configure memory settings
Set up replication
Manage vacuum processes
Deal with connection pooling

3. Native Support for Modern Data Formats

Sample Query

-- Direct querying of Parquet files from S3
CREATE VIEW raw_events AS
SELECT *
FROM read_parquet('s3://my-bucket/*.parquet');

-- Streaming CSV processing
SELECT *
FROM read_csv_auto('https://data.source.com/stream',
    SAMPLE_SIZE = 1000000,
    ALL_VARCHAR = 1);

Building Your Analytics Platform

1. Data Ingestion: Beyond Basic ETL

Let’s build a robust ingestion pipeline:

-- Create a staging area
CREATE SCHEMA IF NOT EXISTS staging;

-- Set up an incremental loading pattern
CREATE OR REPLACE TABLE staging.events AS
WITH new_data AS (
    SELECT *
    FROM read_parquet('s3://bucket/events/date=' || 
        CURRENT_DATE::VARCHAR || '/*.parquet')
),
deduped AS (
    SELECT DISTINCT *
    FROM new_data
    WHERE event_id NOT IN (
        SELECT event_id 
        FROM staging.events
        WHERE date = CURRENT_DATE
    )
)
SELECT * FROM deduped;

-- Add quality checks
CREATE OR REPLACE VIEW staging.data_quality AS
SELECT 
    DATE_TRUNC('hour', timestamp) as check_hour,
    COUNT(*) as record_count,
    COUNT(DISTINCT user_id) as unique_users,
    SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) as null_users,
    AVG(EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - timestamp))) as avg_lag_seconds
FROM staging.events
GROUP BY 1
HAVING COUNT(*) < 100  -- Alert on low volume
    OR avg_lag_seconds > 3600;  -- Alert on high latency

2. Advanced Transformations: Beyond Basic SQL

Building a Customer 360 View

CREATE OR REPLACE VIEW analytics.customer_360 AS
WITH user_first_touch AS (
    SELECT 
        user_id,
        MIN(timestamp) as first_seen,
        FIRST_VALUE(referrer) OVER (
            PARTITION BY user_id 
            ORDER BY timestamp
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) as acquisition_source
    FROM staging.events
    WHERE event_type = 'page_view'
    GROUP BY 1
),
user_engagement AS (
    SELECT 
        user_id,
        COUNT(DISTINCT DATE_TRUNC('day', timestamp)) as active_days,
        COUNT(*) as total_events,
        SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) as lifetime_value,
        MAX(timestamp) as last_seen
    FROM staging.events
    GROUP BY 1
),
user_preferences AS (
    SELECT 
        user_id,
        ARRAY_AGG(DISTINCT category) as interested_categories,
        MODE() WITHIN GROUP (ORDER BY platform) as preferred_platform
    FROM staging.events
    WHERE event_type IN ('view_item', 'purchase')
    GROUP BY 1
)
SELECT 
    u.user_id,
    uft.first_seen,
    uft.acquisition_source,
    ue.active_days,
    ue.total_events,
    ue.lifetime_value,
    ue.last_seen,
    up.interested_categories,
    up.preferred_platform,
    CASE 
        WHEN ue.lifetime_value > 1000 THEN 'VIP'
        WHEN ue.lifetime_value > 500 THEN 'Regular'
        ELSE 'New'
    END as customer_segment
FROM user_first_touch uft
JOIN user_engagement ue USING (user_id)
JOIN user_preferences up USING (user_id)
JOIN users u USING (user_id);

3. Performance Optimization Techniques

Materialized Views with Refresh Strategy

-- Create a materialized view for heavy computations
CREATE OR REPLACE TABLE analytics.daily_metrics AS
SELECT 
    DATE_TRUNC('day', timestamp) as date,
    COUNT(DISTINCT user_id) as dau,
    COUNT(DISTINCT session_id) as sessions,
    SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) as revenue,
    AVG(CASE WHEN event_type = 'purchase' THEN amount END) as aov,
    COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) as paying_users
FROM staging.events
GROUP BY 1;

-- Refresh strategy
CREATE OR REPLACE PROCEDURE refresh_daily_metrics()
AS $$
BEGIN
    -- Backup current data
    CREATE TABLE IF NOT EXISTS analytics.daily_metrics_backup AS 
    SELECT * FROM analytics.daily_metrics;

    -- Refresh with new data
    INSERT INTO analytics.daily_metrics
    SELECT /* calculations */ 
    WHERE date >= CURRENT_DATE - INTERVAL '7 days';

    -- Validate refresh
    IF (SELECT COUNT(*) 
        FROM analytics.daily_metrics 
        WHERE date = CURRENT_DATE) < 100 THEN
        -- Rollback if anomaly detected
        TRUNCATE analytics.daily_metrics;
        INSERT INTO analytics.daily_metrics 
        SELECT * FROM analytics.daily_metrics_backup;
        RAISE EXCEPTION 'Refresh failed: Low record count';
    END IF;
END;
$$;

4. Integration with External Tools

Python Analytics Integration

import duckdb
import pandas as pd
import plotly.express as px
from typing import Dict, List

class AnalyticsEngine:
    def __init__(self, db_path: str = 'analytics.db'):
        self.conn = duckdb.connect(db_path)

    def get_cohort_analysis(self, 
                           start_date: str, 
                           end_date: str) -> pd.DataFrame:
        query = """
        WITH first_purchase AS (
            SELECT 
                user_id,
                DATE_TRUNC('month', MIN(timestamp)) as cohort_month
            FROM staging.events
            WHERE event_type = 'purchase'
            GROUP BY 1
        ),
        monthly_activity AS (
            SELECT 
                fp.user_id,
                fp.cohort_month,
                DATE_TRUNC('month', e.timestamp) as activity_month,
                SUM(CASE WHEN e.event_type = 'purchase' 
                    THEN e.amount ELSE 0 END) as revenue
            FROM first_purchase fp
            JOIN staging.events e USING (user_id)
            GROUP BY 1, 2, 3
        )
        SELECT 
            cohort_month,
            activity_month,
            COUNT(DISTINCT user_id) as active_users,
            SUM(revenue) as revenue,
            activity_month - cohort_month as months_since_first_purchase
        FROM monthly_activity
        WHERE cohort_month BETWEEN ? AND ?
        GROUP BY 1, 2
        ORDER BY 1, 2;
        """
        return self.conn.execute(query, 
                               [start_date, end_date]).fetchdf()

    def visualize_cohorts(self, df: pd.DataFrame) -> None:
        retention_matrix = df.pivot(index='cohort_month',
                                  columns='months_since_first_purchase',
                                  values='active_users')

        fig = px.imshow(retention_matrix,
                       labels=dict(x="Months Since First Purchase",
                                 y="Cohort Month",
                                 color="Active Users"),
                       title="Customer Cohort Analysis")
        fig.show()

Real-World Implementation: E-commerce Analytics Platform

Let’s put it all together with a complete e-commerce analytics solution:

-- Core tables
CREATE TABLE products (
    product_id VARCHAR PRIMARY KEY,
    name VARCHAR,
    category VARCHAR,
    price DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE users (
    user_id VARCHAR PRIMARY KEY,
    email VARCHAR,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    country VARCHAR,
    acquisition_source VARCHAR
);

CREATE TABLE orders (
    order_id VARCHAR PRIMARY KEY,
    user_id VARCHAR REFERENCES users(user_id),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR,
    total_amount DECIMAL(10,2)
);

CREATE TABLE order_items (
    order_id VARCHAR REFERENCES orders(order_id),
    product_id VARCHAR REFERENCES products(product_id),
    quantity INTEGER,
    price_at_time DECIMAL(10,2),
    PRIMARY KEY (order_id, product_id)
);

-- Analytics Views
CREATE OR REPLACE VIEW analytics.product_performance AS
WITH product_metrics AS (
    SELECT 
        p.product_id,
        p.name,
        p.category,
        COUNT(DISTINCT oi.order_id) as orders,
        SUM(oi.quantity) as units_sold,
        SUM(oi.quantity * oi.price_at_time) as revenue,
        COUNT(DISTINCT o.user_id) as unique_customers
    FROM products p
    LEFT JOIN order_items oi USING (product_id)
    LEFT JOIN orders o USING (order_id)
    WHERE o.status = 'completed'
    GROUP BY 1, 2, 3
)
SELECT 
    *,
    revenue / NULLIF(units_sold, 0) as avg_selling_price,
    revenue / NULLIF(unique_customers, 0) as revenue_per_customer
FROM product_metrics;

Performance Monitoring and Optimization

Query Performance Analysis

CREATE OR REPLACE VIEW analytics.query_performance AS
WITH RECURSIVE 
query_patterns AS (
    SELECT 
        regexp_replace(query, '\d+', '#') as pattern,
        COUNT(*) as execution_count,
        AVG(execution_time) as avg_execution_time,
        MAX(execution_time) as max_execution_time,
        MIN(execution_time) as min_execution_time
    FROM system.query_history
    GROUP BY 1
),
slow_queries AS (
    SELECT 
        pattern,
        execution_count,
        avg_execution_time,
        RANK() OVER (ORDER BY avg_execution_time DESC) as rank
    FROM query_patterns
    WHERE execution_count > 10
)
SELECT *
FROM slow_queries
WHERE rank <= 10;

Conclusion: The Future of Data Stacks

After implementing this stack across multiple use cases, here’s what I’ve learned:

Simplicity Wins: DuckDB’s approach of “batteries included” means fewer moving parts and less operational overhead.
Performance at Scale: For many use cases, DuckDB performs better than distributed systems for datasets up to several TB.
Cost Efficiency: By eliminating the need for separate storage and compute layers, costs can be reduced by 80% or more.
Developer Experience: The ability to query data directly from files and integrate seamlessly with Python/R makes development significantly faster.

Next Steps and Resources

To implement this stack:

Start with a proof of concept using your smallest dataset
Build incrementally, adding complexity only when needed
Monitor performance and optimize queries
Scale horizontally by partitioning data if necessary

Product Data & Everything

Everything For Everything

Building the Modern Data Stack

A Deep Dive into DuckDB-Powered Analytics

The Evolution of Data Architecture

Why DuckDB is a Game-Changer in 2024

The Traditional Stack Problems

Enter DuckDB: The Swiss Army Knife of Analytics

Building Your Analytics Platform

Real-World Implementation: E-commerce Analytics Platform

Performance Monitoring and Optimization

Query Performance Analysis

Conclusion: The Future of Data Stacks

Next Steps and Resources

Leave a comment Cancel reply

Published by

Bidya Bhushan Bibhu

A Deep Dive into DuckDB-Powered Analytics

The Evolution of Data Architecture

Why DuckDB is a Game-Changer in 2024

The Traditional Stack Problems

Enter DuckDB: The Swiss Army Knife of Analytics

Building Your Analytics Platform

Real-World Implementation: E-commerce Analytics Platform

Performance Monitoring and Optimization

Query Performance Analysis

Conclusion: The Future of Data Stacks

Next Steps and Resources

Spread the treasure

Leave a comment Cancel reply

Published by

Bidya Bhushan Bibhu

Discover more from Product Data & Everything