tabular_embedding_01 — learned embeddings for structured data

Short note on tabular_embedding_01, a 2025 project (May–Oct, private). ~107 R-B commits. The repo does not have a public README; this is the shape from the inside.

The problem

A lot of financial data is tabular — features are a mix of numeric, categorical, and ordered types, often with missing values and column meanings that shift over time. Tree-based models handle that gracefully; neural models don’t, without work. The question was whether a learned embedding layer per column, trained jointly with a downstream head, could close the gap — and in particular whether a pretrained tabular transformer (TabPFN-shaped) could serve as a base that you fine-tune on your specific dataset.

What I built

  • Per-column embedding modules sized to the column’s type and cardinality, with a shared downstream head.
  • A fine-tune loop for a TabPFN-shaped backbone against a finance-flavoured downstream task.
  • Baselines: XGBoost, a plain MLP with one-hot encoding, and a version with the learned embeddings but no pretrained backbone. Three numbers next to each other on each run.

Where it plateaued

  • On the finance-flavoured tasks I threw at it, XGBoost stayed the cheapest honest baseline. The learned embeddings closed some of the gap; the pretrained backbone sometimes helped, sometimes didn’t, depending on how close the downstream distribution was to the pretraining distribution.
  • The embedding layer is a real feature where you want to combine tabular data with text or with a time series — the joint model has a place to put the tabular side without flattening it. On pure tabular tasks, it’s often net negative versus a good tree.

What I’d claim

  • A working training pipeline, with baselines, on several datasets. Not production; not a research result worth writing up on its own.
  • The project was useful as a forcing function for thinking about the finance_lab input shapes later — specifically, what a “feature column” is once you’ve committed to neural models.

What I wouldn’t claim

  • State-of-the-art on a benchmark. I didn’t run Open Tabular Benchmark; the comparisons were against internal baselines.
  • Pretrained-backbone is a silver bullet. For my datasets, it was one tool among several, not a clear winner.

Private repo.

Written on July 10, 2025