CES Hackathon Project

A reproducible election-forecasting pipeline that benchmarks theory-driven voter-level regression against simple survey aggregation, then improves performance via Pareto-optimal feature selection, calibration, and cross-year validation.

Read time

7 min read

Published date

 Feb 9, 2026

Category

Machine Learning

Weekly Newsletter Update!

Stay up-to-date with the latest innovations, features, and tips in no-code website building!

Why this project (and why “poll averaging” is not the whole story)

Poll or survey aggregation is a strong baseline because it is transparent and resistant to overfitting. But aggregation also collapses the mechanism: it summarizes what respondents report, not how combinations of ideology, retrospective evaluations, trust, and issue salience jointly map into vote choice.


This project frames election forecasting as a comparative research design:

Baseline: simple aggregation of survey signals

Alternative: voter-level models (regression) that encode voting-behavior theory

Improvement layer: auditable, multi-objective feature selection that trades off accuracy and complexity

Central question: Can modeling individual voter reasoning produce better forecasts than simply averaging poll responses?


Data and scope

The pipeline is explicitly built around major public election-study infrastructures

ANES (US): 2012, 2016, 2020

CES (Canada): 2011, 2015, 2019, 2021, 2025

The Canadian setting is particularly demanding because forecasting is not just “two-party swing.”


The workflow supports multinomial vote choice (six-party classification: CPC/LPC/NDP/Bloc/Green/PPC) and treats survey weighting as first-class rather than an afterthought.

Methodology: a two-step comparative design

Step 1: Direct replication (theory-driven)

Step 1 follows Camatarri (2024) using manual feature selection motivated by voting-behavior theory. The repository explicitly enumerates feature families such as:


  • ideology (left-right self-placement)

  • retrospective economic evaluations (national economy, personal finances)

  • trust in government

  • issue salience (climate, housing, immigration)

  • demographics (age, gender, education, ethnicity)

  • province fixed effects


Models include:


  • Multinomial logistic regression (multi-party vote choice)

  • Bayesian logistic regression (MCMC: 4 chains, 2000 iterations)



Aggregation baseline: weighted mean threshold classification with survey weights, compared against simple poll/survey aggregation.


The point is not that theory-driven selection is “best,” but that it provides a transparent benchmark against which any automated selection must justify itself.

Step 2: PPV–Pareto enhancement (data-driven, constraint-aware)

Models include:


  • Multinomial logistic regression (multi-party vote choice)

  • Bayesian logistic regression (MCMC: 4 chains, 2000 iterations)



Aggregation baseline: weighted mean threshold classification with survey weights, compared against simple poll/survey aggregation.


The point is not that theory-driven selection is “best,” but that it provides a transparent benchmark against which any automated selection must justify itself.

Create a free website with Framer, the website builder loved by startups, designers and agencies.