My Framer Site

Placeholder

CES Hackathon Project

A reproducible election-forecasting pipeline that benchmarks theory-driven voter-level regression against simple survey aggregation, then improves performance via Pareto-optimal feature selection, calibration, and cross-year validation.

Read time

7 min read

Published date

Feb 9, 2026

Why this project (and why “poll averaging” is not the whole story)

Poll or survey aggregation is a strong baseline because it is transparent and resistant to overfitting. But aggregation also collapses the mechanism: it summarizes what respondents report, not how combinations of ideology, retrospective evaluations, trust, and issue salience jointly map into vote choice.

This project frames election forecasting as a comparative research design:

Baseline: simple aggregation of survey signals

Alternative: voter-level models (regression) that encode voting-behavior theory

Improvement layer: auditable, multi-objective feature selection that trades off accuracy and complexity

Central question: Can modeling individual voter reasoning produce better forecasts than simply averaging poll responses?

Data and scope

The pipeline is explicitly built around major public election-study infrastructures

ANES (US): 2012, 2016, 2020

CES (Canada): 2011, 2015, 2019, 2021, 2025

The Canadian setting is particularly demanding because forecasting is not just “two-party swing.”

The workflow supports multinomial vote choice (six-party classification: CPC/LPC/NDP/Bloc/Green/PPC) and treats survey weighting as first-class rather than an afterthought.

Methodology: a two-step comparative design

Step 1: Direct replication (theory-driven)

Step 1 follows Camatarri (2024) using manual feature selection motivated by voting-behavior theory. The repository explicitly enumerates feature families such as:

ideology (left-right self-placement)
retrospective economic evaluations (national economy, personal finances)
trust in government
issue salience (climate, housing, immigration)
demographics (age, gender, education, ethnicity)
province fixed effects

Models include:

Multinomial logistic regression (multi-party vote choice)
Bayesian logistic regression (MCMC: 4 chains, 2000 iterations)

Aggregation baseline: weighted mean threshold classification with survey weights, compared against simple poll/survey aggregation.

The point is not that theory-driven selection is “best,” but that it provides a transparent benchmark against which any automated selection must justify itself.

Step 2: PPV–Pareto enhancement (data-driven, constraint-aware)

Models include:

Multinomial logistic regression (multi-party vote choice)
Bayesian logistic regression (MCMC: 4 chains, 2000 iterations)

Aggregation baseline: weighted mean threshold classification with survey weights, compared against simple poll/survey aggregation.

The point is not that theory-driven selection is “best,” but that it provides a transparent benchmark against which any automated selection must justify itself.