For one reason or another, I'm often asked to show off software or ML/AI projects that I've worked on in previous roles. Doing so is usually tricky because I don't work at those jobs anymore, and even if I did, the code likely wouldn't be releasable to the public. What's more, since many of my biggest projects were deployed years ago, there are now much better methods, techniques, and modern tooling for these same problems.

So I had an idea: why not develop an end-to-end simulation engine, which recreates the technical problem space of my previous roles, without the proprietary elements. Then I could safely rebuild my projects as modern applications that demonstrate the types of technical solutions that I've built over the years.

This post is the result of that idea. This complimentary repository contains full solutions to two end-to-end time series ML scenarios drawn, from real techincal problems that I previously solved in industry. This post will cover the full lifecycle, from synthetic data generation through model training to a production-style serving layer.

The first scenario implements a freight forecasting system for computer networking hardware return (RMA) logistics, leveraging GRU recurrent networks and Facebook Prophet. The second is an unsupervised clustering pipeline for network security events which uses K-means with Latent Semantic Analysis. Both are problems that I actually worked on and deployed solutions for in the real world.

The Simulation Engine

Most ML problems don't have a well-defined statistical structure that can be recreated synthetically. Luckily, certain types of time series problems are built different! A well scoped freight forecasting problem has weekly seasonality, regional variation, categorical hierarchies, and trend components. A network security clustering problem may have multimodal event types with distinct numerical and textual signatures. If you embed these patterns into a synthetic data pipeline with the right distributions and correlations, you can use some of the same modeling approaches that you would in a real world ML environment, forcing similar design decisions and technical trade-offs, all without touching proprietary data.

The pipeline runs as a sequence of numbered scripts: generate data, explore, train, evaluate, export, serve. Each script consumes the output of the previous one. The data generation system is config-driven. Everything flows from a single settings.yaml, with an extensive preset system for creating dataset variants:

# Default 2-year dataset (~262K RMA records, 30K network events)
python scripts/01_generate_data.py

# High-volatility variant to stress-test models
python scripts/01_generate_data.py -n noisy -p noisy

# Short history to test model degradation (90 days, 5K events)
python scripts/01_generate_data.py -n small -p small

# Then train on any variant
python scripts/03_train_rma_model.py -d noisy

Every downstream script accepts a --dataset-name flag, so you can train and then evaluate against any dataset variant without modifying the model code. This turns "how does the model handle noisier data?" from a thought experiment into a one-liner. Configuration presets live in config/presets/ and override only the data generation parameters, keeping everything else in the codebase stable.

The RMA time series dataset generator produces roughly 262,000 records across 730 days, 5 regions, and 15 SKU categories by default. Embedded patterns include weekly seasonality (weekends drop to 50-70% of weekday volume), end-of-month spikes (1.3x multiplier), regional growth trends (APAC at +8% annually), and SKU-specific weight distributions ranging from 0.1 kg for RAM modules to 8 kg for chassis units.

On the other hand, the network events generator produces 30,000 datapoints across 30 days with six distinct behavioral clusters, each with characteristic duration, byte transfer, port usage, and log message vocabulary patterns.

The goal is not to offer synthetic data as a perfect replacement for production data. However constructing the datasets in this way forced me make many of the same modeling process decisions as I did in my previous real-world experiences. Temporal train/test splits, sliding window construction, embedding dimension selection, K selection for clustering, TF-IDF tuning for featues, early stopping criteria, model serialization, inference API design etc... All of these features require design and engineering tradeoffs, regardless of the data's origin. The simulation engine allows me demonstrate those decisions transparently.

Scenario 1: RMA Freight Forecasting

The real world problem: a global hardware manufacturer processes many thousands of RMA (Return Material Authorization) requests daily. Components ship back from customers worldwide, and logistics needs to allocate freight capacity to partners with as much granualrity and lead time as pssible. This company was historically forecasting freight capacity (in total kg) once per month. One static number applied uniformly across 30 days that unfortunately often varied between 200 kg and 600 kg. Such a crude approach resulted in a roughly 180 kg mean absolute error per day, translating to over $75,000 a month in emergency shipping fees and/or wasted pre-booked capacity.

My goal was twofold:
1. Increase forecast frequency from monthly to daily
2. Reduce the forecasting error by at least 35%.

Preprocessing

To start, I implemented a RMAPreprocessor class, which performs a full data transformation from raw records to modern DNN model-ready PyTorch tensors. It aggregates raw records to daily regional totals, encodes categorical variables, and scales numerical features. An early critical design decision is how to split the data.

The _temporal_split method specifically divides the aggregated time series chronologically according to the train_ratio parameter: in practice, the first 70% of days become training data, the next 15% validation, and the final 15% test:

def _temporal_split(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    n = len(df)
    train_end = int(n * self.config.train_ratio)
    val_end = int(n * (self.config.train_ratio + self.config.val_ratio))

    train_df = df.iloc[:train_end].copy()
    val_df = df.iloc[train_end:val_end].copy()
    test_df = df.iloc[val_end:].copy()

    return train_df, val_df, test_df

Why not split randomly? Because random splitting would allow the model to train on December data and test on June data, using future information to predict the past. This is the most common form of data leakage in time series and it produces artificially optimistic metrics that fail to generalize in production. Temporal splits prevent this: the model never sees future data during training. For the same reason, encoders and scalers are fit on the training split only, then applied to validation and test.

Once the data is split and transformed, the RMAPreprocessor hands each split off to a separate RMATimeSeriesDataset object, which is a PyTorch Dataset subclass that converts the flat DataFrame into overlapping input/target sequences. This is where the sliding window logic lives. The dataset "slides" a dataset "window" one day at a time through the split, taking 30 consecutive days of features as input and the next 7 days of shipping weights as the prediction target:

# Create sliding windows
for i in range(n - min_length + 1):
    seq_df = df.iloc[i:i + sequence_length]
    target_df = df.iloc[i + sequence_length:i + sequence_length + prediction_horizon]

    cat_features = seq_df[categorical_cols].values.astype(np.int64)
    num_features = seq_df[numerical_cols].values.astype(np.float32)
    targets = target_df[target_col].values.astype(np.float32)

    self.sequences.append({
        "categorical": cat_features,
        "numerical": num_features,
        "target": targets,
    })

From 2 years of data, this generates roughly 700 training sequences. Each one is an independent sample that a forecasting model can learn from. The sliding window approach is itself a significant upgrade over the old monthly brute force process: instead of one static forecast per month, we now produce a fresh 7-day forecast every day with no ML at all!

GRU Architecture

Even though sliding window forecasts are a major improvement, modern ML techniques can take us farther. I chose Gated Recurrent Units over LSTMs for time series forecasting here because for 30-day sequences, GRUs perform comparably with fewer parameters. If you want the full intuition for how recurrent gates work, Christopher Olah's "Understanding LSTMs" is the canonical explanation. It covers vanilla RNNs, LSTMs, and GRUs in a single post. The short version: GRU's two-gate mechanism (update and reset) is sufficient for moderate-length temporal dependencies. The simpler architecture is more efficient to train and easier to explain in a technical walkthrough.

During my development process, the model evolved through three versions, all in gru_forecaster.py. V1 is the baseline: numerical features only through a 2-layer stacked GRU with dropout for regularization:

class GRUForecasterV1(nn.Module):
    def __init__(self, n_numerical_features, hidden_size=64, num_layers=2,
                 dropout=0.2, prediction_horizon=7):
        super().__init__()
        self.gru = nn.GRU(
            input_size=n_numerical_features,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_size, prediction_horizon)

    def forward(self, numerical, categorical=None):
        output, h_n = self.gru(numerical)
        final_hidden = h_n[-1]  # (batch, hidden_size)
        out = self.dropout(final_hidden)
        return self.fc(out)     # (batch, prediction_horizon)

V2 adds categorical embeddings. Instead of one-hot encoding 5 regions and 15 SKU types into 20+ sparse columns, embeddings learn dense representations. This is the same idea behind word2vec, applied to categorical features. Region becomes a 4-dimensional vector, SKU category becomes 8-dimensional. For example, the model discovers that "CPU" and "GPU" have similar shipping patterns and places their embeddings close together:

self.embeddings = nn.ModuleDict({
    name: nn.Embedding(
        num_embeddings=vocab_size,
        embedding_dim=embedding_dims.get(name, 4),
    )
    for name, vocab_size in vocab_sizes.items()
})

These get concatenated with numerical features before entering the GRU, expanding input from 4 to 20 dimensions.

V3 adds production refinements that matter for training stability. The forward pass shows the key additions: layer normalization after the GRU, a residual connection from the averaged embeddings (the same skip-connection idea from ResNet, applied here to ensure embedding information reaches the output), and a 2-layer MLP output:

def forward(self, numerical, categorical):
    # ... embeddings computed as in V2 ...

    # Average embeddings for residual connection
    avg_embeddings = all_embeddings.mean(dim=1)  # (batch, total_embed_dim)

    # GRU forward pass
    combined = torch.cat([numerical, all_embeddings], dim=-1)
    output, h_n = self.gru(combined)
    final_hidden = h_n[-1]

    # Layer norm and residual connection
    normalized = self.layer_norm(final_hidden)
    combined_output = torch.cat([normalized, avg_embeddings], dim=-1)

    # Output MLP
    out = self.dropout(combined_output)
    out = self.relu(self.fc1(out))
    out = self.dropout(out)
    return self.fc2(out)

The layer normalization stabilizes training when combining heterogeneous features. The residual connection ensures categorical information flows directly to the prediction head. This is useful when certain SKU categories consistently have higher weights regardless of temporal patterns.

The training loop in rma_trainer.py uses MSE loss, Adam at 1e-3, and three stabilization mechanisms that matter for recurrent networks:

# Learning rate scheduler - halve LR when validation stalls
# (see https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html)
self.scheduler = ReduceLROnPlateau(
    self.optimizer, mode="min", factor=0.5, patience=5,
)

# Early stopping - stop when validation loss hasn't improved for 10 epochs
# (see https://pytorch.org/tutorials/beginner/early_stopping_tutorial.html)
self.early_stopping = EarlyStopping(patience=10, mode="min")

# Inside training loop: gradient clipping for RNN stability
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_grad_norm)

Gradient clipping at norm 1.0 prevents the exploding gradient problem that compounds across the 30 time steps of the recurrent sequence. Without it, training destabilizes periodically.

Prophet as an Alternative

Neural networks aren't always the right tool and even if they work, they can be more trouble than they are worth to deploy and maintain. As a remix ProphetForecaster wraps Meta's Prophet library with a fundamentally different approach: fit an explicit additive decomposition of trend, seasonality, and regressor effects. The original paper by Taylor and Letham is worth reading in order to understand the full methodology.

I used multiplicative seasonality because shipping volumes grow over time. A 10% Tuesday spike on a baseline of 500 kg (adding 50 kg) is different from a 10% spike on 1000 kg (adding 100 kg). Multiplicative captures this proportionality. Hyndman & Athanasopoulos give a clear treatment of when to use additive vs multiplicative decomposition.

Prophet V2 adds three exogenous regressors. The data preparation in rma_preprocessor.py shows how the preprocessor builds the Prophet-format DataFrame with regressor columns:

def prepare_for_prophet(self, df, include_regressors=True):
    # Aggregate to daily level (global across all regions)
    daily_df = df.groupby("date").agg({
        "shipping_weight_kg": "sum",
        "request_urgency": "mean",
        "failure_rate_pct": "mean",
    }).reset_index()

    prophet_df = pd.DataFrame({
        "ds": pd.to_datetime(daily_df["date"]),
        "y": daily_df["shipping_weight_kg"],
    })

    if include_regressors:
        # Month-end feature: the data embeds a 1.3x EOM spike
        # that pure seasonality won't capture
        prophet_df["is_month_end"] = (
            prophet_df["ds"].dt.day >= 28
        ).astype(float)
        prophet_df["failure_rate_pct"] = daily_df["failure_rate_pct"].values
        prophet_df["avg_urgency"] = daily_df["request_urgency"].values

    return train_df, val_df, test_df  # same temporal split ratios

One key advantage of Prophet is interpretability. When a stakeholder asks "why is next Tuesday's forecast so high?", Prophet can answer: "because it's end of month, which adds 15%, and Tuesdays historically have 10% higher volume." A GRU cannot do this. Prophet also provides built-in uncertainty quantification with 80% confidence intervals out of the box. The key disadvantage is accuracy: Prophet trails GRU by about 6-8 kg MAE on this problem.

Results

Approach Frequency MAE (kg) vs Old Monthly
Old Monthly1x/month~180--
Daily NaiveDaily114-37%
Prophet V2Daily~68-62%
GRU V3Daily62-66%

Two things happened simultaneously. First, moving from monthly to daily forecasting, even with a naive "repeat yesterday" approach, cuts error by 37%. That's free improvement from simply forecasting more often. Second, the GRU model adds another 46% improvement on top of the daily naive baseline by learning weekly patterns, end-of-month spikes, and regional trends.

The R-squared is only about 0.09. In forecasting, that's not the metric that matters. Hyndman and Athanasopoulos's standard forecasting textbook doesn't even include R-squared among its recommended accuracy measures, because it captures in-sample fit rather than out-of-sample prediction quality. Daily shipping volumes are inherently volatile: customer behavior, supply chain disruptions, and weather introduce variation that no model should try to overfit. For context, Gu, Kelly, and Xiu (2020) reported out-of-sample R-squared values of 0.3--0.4% for neural network stock return predictions, and that result was published in The Review of Financial Studies as a major advance in the field. MAE is what matters for operational decisions, and a 66% reduction in daily forecast error translates to meaningful savings in emergency shipments and capacity utilization.

Each of the versions tell an interesting, and unique, story about diminishing returns. V2 barely improved over V1 because in this synthetic dataset, aggregating by region already captures most of the categorical signal. In messier real-world data with more complex category interactions, you might see bigger gains from embeddings. V3's production refinements (layer norm, residual connections) added stability more than accuracy which was exactly their purpose.

Scenario 2: Network Event Clustering

The real problem: an enterprise security operations center generates millions of network events daily: web requests, database queries, authentication attempts, system maintenance. Buried in this flood are signals of suspicious activity. No one can manually review millions of events, and new attack patterns emerge constantly. We needed an approach that discovers structure without labeled training data, the domain of unsupervised learning.

The Multimodal Challenge

Network events are inherently multimodal. The numbers tell part of the story: a 3-second connection transferring 50 MB is suspicious. But the log message adds context: "bulk download initiated by admin" is very different from "DNS tunneling pattern detected." By combining numerical features with text analysis, we get richer clustering than either approach alone.

The NetworkPreprocessor runs two parallel feature extraction paths. Numerical features go through log transformation (duration and bytes are heavy-tailed) and standard scaling. Text features go through a TF-IDF to LSA pipeline. The text cleaning step is small but important:

# Normalizes raw log text: mask IPs and numbers, strip punctuation
def _clean_text(self, text: str) -> str:
    text = str(text).lower()
    text = re.sub(r'\d+\.\d+\.\d+\.\d+', 'IP_ADDR', text)
    text = re.sub(r'\b\d+\b', 'NUM', text)
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

IP addresses get replaced with a placeholder token (they're handled as separate features), and numbers get normalized. This reduces vocabulary noise so TF-IDF can focus on the words that actually distinguish event types. If TF-IDF is new to you, the scikit-learn documentation gives a concise explanation of the weighting scheme.

The full text pipeline fits TF-IDF then reduces dimensionality with Latent Semantic Analysis (LSA, also called Latent Semantic Indexing). The original Deerwester et al. paper is foundational, but for a practical introduction, the scikit-learn decomposition docs cover the implementation directly:

self.tfidf = TfidfVectorizer(
    max_df=0.95, min_df=2,
    ngram_range=(1, 2),           # capture phrases like "port scan"
    max_features=1000,
    stop_words="english",
)
tfidf_matrix = self.tfidf.fit_transform(cleaned_texts)

self.svd = TruncatedSVD(n_components=20, random_state=42)
lsa_features = self.svd.fit_transform(tfidf_matrix)

The 20 LSA components are latent topics: directions in vocabulary space that capture co-occurring terms. Topic 0 might load heavily on "request", "api", "completed", "http" (web traffic vocabulary). Topic 1 might capture "failed", "authentication", "password", "blocked" (the authentication failure pattern). These topics emerge from the data. We didn't tell the algorithm what words to look for.

The preprocessor concatenates everything into a 25-dimensional feature matrix:

combined_features = np.hstack([numerical_scaled, text_features])

Five scaled numerical features plus 20 LSA components. All features are standardized so K-means doesn't weight them by raw magnitude.

K-Means with Automatic K Selection

The KMeansClusterer wraps scikit-learn's K-means (initialized with K-means++ for smarter centroid seeding) with automatic K selection. The elbow method finder uses a geometric approach: find the point with maximum perpendicular distance from the line connecting the first and last inertia values:

def _find_elbow_point(self, k_values, inertias):
    # Normalize to [0, 1] range
    k_norm = (np.array(k_values) - min(k_values)) / max(k_norm)
    inertia_norm = (np.array(inertias) - min(inertias)) / max(inertia_norm)

    # Line from first to last point
    p1 = np.array([k_norm[0], inertia_norm[0]])
    p2 = np.array([k_norm[-1], inertia_norm[-1]])

    # Find point with maximum perpendicular distance
    distances = []
    for i in range(len(k_values)):
        p = np.array([k_norm[i], inertia_norm[i]])
        d = np.abs(np.cross(p2 - p1, p1 - p)) / np.linalg.norm(p2 - p1)
        distances.append(d)

    return k_values[np.argmax(distances)]

The silhouette score provides a second opinion. It measures how similar each point is to its own cluster versus the nearest neighbor cluster, ranging from -1 (wrong cluster) to +1 (perfect match). If the best silhouette K differs from the elbow K by more than 0.05, the clusterer defers to silhouette. On this data, both methods agree on K=5.

The synthetic data has 6 ground-truth clusters. The model merged two similar clusters, likely the normal web and normal database traffic, which share overlapping feature distributions. This is realistic. In production, you rarely recover exactly the "true" number of clusters. The clusterer also computes anomaly scores (normalized distance from each point to its assigned centroid), so events far from any cluster center can be flagged for investigation.

Evaluation

Metric Value Interpretation
Clusters found5vs 6 ground truth
Silhouette0.19Modest separation
Calinski-Harabasz7,468Good variance ratio
Davies-Bouldin1.39Moderate overlap

A silhouette of 0.19 doesn't mean the clustering is useless. Mixed feature types (numerical + text) create complex geometry, log-normal distributions cause within-cluster spread, and some cluster types genuinely overlap. Even with imperfect separation, the clustering reduces 30,000 events to a handful of actionable groups. Instead of reviewing everything, analysts focus on the ~3,000 events in suspicious clusters.

Because the data is synthetic, the ClusteringEvaluator can also compute external metrics, specifically Adjusted Rand Index and Normalized Mutual Information, that measure agreement with ground truth. These metrics aren't available in production (you won't have labels for real network events), but they validate that the pipeline is working as designed.

The Serving Layer

Both models are served behind a FastAPI application. The app definition captures the two-model architecture:

app = FastAPI(
    title="Time Series ML Demonstration API",
    description="API for RMA shipping weight forecasting and network event classification.",
    version="1.0.0",
)

@app.post("/api/v1/rma/forecast", response_model=RMAForecastResponse)
async def forecast_rma(request: RMAForecastRequest):
    engine = get_rma_engine()
    if engine is None:
        raise HTTPException(status_code=503, detail="RMA model not available")

    data_dicts = [point.dict() for point in request.historical_data]
    df = pd.DataFrame(data_dicts)

    result = engine.predict(df, return_confidence=True)
    # ... format predictions with confidence intervals ...

@app.post("/api/v1/network/classify", response_model=NetworkClassifyResponse)
async def classify_network_event(request: NetworkEventRequest):
    engine = get_clustering_engine()
    if engine is None:
        raise HTTPException(status_code=503, detail="Clustering model not available")

    result = engine.classify(pd.DataFrame([request.dict()]))
    # ... format classification with anomaly score ...

The key design choice here is that inference engines (mlops/inference.py) handle preprocessing internally. The RMAInferenceEngine rebuilds the model from saved configuration, loads weights, and applies the same encoding and scaling that was fitted during training:

class RMAInferenceEngine:
    def __init__(self, registry_path="outputs/models", version="v3"):
        self.registry = ModelRegistry(registry_path)
        artifacts = self.registry.load_rma_model(version)

        self.weights = artifacts["weights"]
        self.preprocessor_state = artifacts["preprocessor"]
        self.config = artifacts["config"]

        self.model = self._build_model()
        self.model.load_state_dict(self.weights)
        self.model.eval()

    def predict(self, data: pd.DataFrame, return_confidence=False):
        processed = self._preprocess(data)  # encode, scale, tensorize
        with torch.no_grad():
            predictions = self.model(processed["numerical"], processed["categorical"])
        return {"predictions": predictions.cpu().numpy(), ...}

A client sends raw feature values, not pre-scaled tensors. This is a small detail that matters enormously in production: if the client has to replicate your preprocessing pipeline, you have a synchronization problem that will eventually produce silent errors. The ModelRegistry stores preprocessor state (encoder mappings, scaler parameters) alongside model weights so they can never drift apart.

Wrapping Up

These two solution pipelines cover different aspects of time series ML, but interestingly enough, many of the same engineering patterns repeat. Both pipelines start with a preprocessing class that owns the data transformation logic and then enforces temporal integrity. Both produce model artifacts that bundle learned parameters (weights, centroids, scaler state) so that nothing can drift between training and inference. Both are able to be served through a nearly identical FastAPI layer, where the client sends raw data and the inference engine handles everything else.

From a results perspective, the GRU forecasting pipeline reduced daily MAE from 180 kg to 62 kg, with the Prophet variant trailing close behind at 68 kg in exchange for full interpretability and built-in uncertainty intervals.

The clustering pipeline compressed 30,000 network events into 5 actionable groups with automatic K selection, combining numerical and text features through a TF-IDF/LSA pipeline that surfaces latent event structure without labeled training data. Neither result is perfect, and that's the point. A silhouette score of 0.19 and an R-squared of 0.09 reflect the inherent noise in these domains, not necessarily modeling failures. The metrics that matter for operations (MAE, anomaly scoring, cluster actionability) tell a different and more useful story.

The full repo is runnable end-to-end. Clone it, use it to generate a dataset, train both models, and hit the API. The preset system makes it straightforward to stress-test against noisy or limited data variants.

As for what's next...Stay tuned for more technical campfire stories from my ML/AI past! Cheers.


References and Further Reading

About the author:

Recurrent networks:

Architecture components:

Forecasting:

Clustering and text analysis:


Cover image by Shubham Dhage on Unsplash.