Technical Methodology

ML-Powered Value-Weighted
Dollar-Cost Averaging

A hybrid machine-learning system combining a gradient-boosted ensemble, a recurrent attention network, and transformer-based sentiment analysis to set the monthly allocation amount within a strict, budget-neutral dollar-cost-averaging framework.

01

System Architecture

The system combines three complementary model components, each producing a distinct market view that is fused into a single allocation signal. These are model components, not autonomous agents; the dashboard's advisor personas are a separate, rule-based presentation layer.

analytics

Technical Agent

Three-model stacking ensemble — XGBoost[1], LightGBM, and CatBoost as base learners with a Ridge meta-learner for final prediction.

XGBoostLightGBMCatBoostRidge
timeline

Temporal Agent

PyTorch GRU with a scaled dot-product attention head, capturing sequential dependencies in price dynamics over variable-length lookback windows.

GRUAttentionPyTorch
article

NLP Agent

ProsusAI/FinBERT[2] transformer fine-tuned on financial text. Produces headline-level sentiment with confidence-weighted aggregation.

FinBERTTransformers
02

Feature Engineering

33 technical features are computed from OHLCV data and external signals, organized into eight categories.

MomentumRSI, ROC-5, ROC-10, Stochastic %K, Stochastic %D, Williams %R
TrendSMA short / long, MA crossover, trend slope 5d / 20d, trend direction
VolatilityATR%, Bollinger %B, Bollinger width, Garman-Klass volatility, volatility squeeze, volatility regime
VolumeRelative volume, volume shock
TimeHour / day-of-week cyclical encoding (sin & cos pairs)
PatternCandlestick pattern confidence, chart pattern confidence
StatisticalHurst exponent, rolling Sharpe 20d, efficiency ratio, return skewness, return kurtosis
SentimentVIX z-score, Fear & Greed Index
03

Signal Composition

The deployed daily scanner produces its signal as a 5-component weighted composite, blending mean-reversion, momentum, trend, ML, and sentiment views. This is a hand-set heuristic prototype; the thesis's validated allocation differs (standardized causal signals, weights tuned on held-out data, a dedicated forward-cheapness ML model, and budget neutrality via a self-funded reserve), and it is what produces the results in the Validation section.

30%
Trailing 20-day returnMean reversion — buy more after drawdowns
25%
RSI signalOversold / overbought oscillator
20%
Price vs. SMATrend-following confirmation
15%
ML ensemble scoreStacking ensemble predicted return
10%
VIX / sentiment fear signalContrarian fear indicator
04

Validation

Purged Walk-Forward Cross-Validation

Following de Prado[5], training and test folds are separated by a purge gap equal to the maximum label horizon, preventing information leakage from overlapping samples. An embargo period further removes observations whose features could span the train/test boundary.

Causal, Budget-Neutral Evaluation

The allocation backtest is strictly causal: both strategies buy on the same fixed day and budget neutrality is enforced by a self-funded reserve, with no in-window day selection or full-horizon renormalization (either of which would inflate a naive backtest toward 3%). Allocation weights are tuned on a set of symbols and rolling windows that is disjoint from the reported instruments, which are scored once with frozen weights.

Results

On held-out assets (2024 onward) the system delivers a consistent reduction in average cost basis of about 0.5%, positive on seven of eight instruments at roughly 100% capital deployment. A cross-asset extension that reallocates the budget across a basket is larger and statistically significant: a mean cross-sectional active return with t = 2.22 (p = 0.028) over 128 monthly observations.

05

Explainability

Per-prediction feature attributions are computed using SHAP[3] TreeExplainer on the XGBoost base model. Each signal displayed in the dashboard includes a waterfall chart decomposing the allocation score into additive contributions from individual features — enabling the investor to understand why the model recommends increasing or decreasing allocation on any given day.

06

NLP Pipeline

Financial sentiment is extracted from live news via a three-stage pipeline. Up to 50 articles are analyzed per symbol per day.

  1. Collection — Google News RSS feed is scraped for headlines matching tracked asset tickers and sector keywords.
  2. Classification — Each headline is scored by FinBERT[2] (Araci, 2019), a BERT model pre-trained on financial corpora, producing a 3-class probability distribution (positive / neutral / negative).
  3. Aggregation— Per-headline scores are aggregated using confidence-weighted averaging, where the weight is the model's softmax confidence. This suppresses low-conviction predictions.
07

Portfolio Optimization

Multi-asset allocation follows Modern Portfolio Theory[4] (Markowitz, 1952). The efficient frontier is computed via SciPy SLSQP constrained optimization with per-asset-class bounds (equities 40-80%, bonds 10-40%, alternatives 0-20%). Three pre-configured risk profiles — conservative, balanced, and aggressive — correspond to target annualized volatilities of 8%, 12%, and 18%, respectively.

08

Multi-Asset Portfolio & Currency Conversion

The system supports a multi-asset portfolio across 10 tracked instruments: VWCE.DE, SPY, QQQ, EFA, EEM, GLD, TLT, BTC-USD, AAPL, XOM. Portfolio value is computed by fetching the latest price for each invested symbol and converting to the user's chosen display currency using live exchange rates.

Exchange Rate API

Currency conversion uses the fawazahmed0/currency-api (free, daily-updated, 150+ currencies) with a 1-hour cache. Supported display currencies include USD, EUR, GBP, CHF, RON, JPY, CAD, and AUD. Investments are stored in the asset's native trading currency and converted on-the-fly for display.

Alpha Calculation

The dashboard's simulated alpha compares Smart DCA against fixed-schedule DCA over the period currently shown. Both buy on the same fixed day each month and invest the same total; Smart only tilts the monthly amount by that day's allocation signal (causal, no look-ahead). Because it runs on the displayed window of recent daily signals from the deployed scanner, it is a short, noisy figure, not the thesis's long-run, tuned-model result. Strategy-level risk metrics (Sharpe ratio, maximum drawdown) are computed from monthly portfolio returns.

REF

References

[1]Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD.

[2] Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063.

[3]Lundberg, S. & Lee, S. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30.

[4] Markowitz, H. (1952). Portfolio Selection. The Journal of Finance, 7(1), 77-91.

[5] de Prado, M.L. (2018). Advances in Financial Machine Learning. Wiley.