Kaggle Solution Showcase

Trading at the close1

This is a very good example of how a kaggle competition has very complex data. As described in the competition description, the last 10 minutes of a trading data are often much more volatile than the previous day.

As mentioned in the write up of the competition, “For 1760 users (including 83 in the top 100!), this was their first competition.”

This includes me!

Remember, we are measuring MSE. Let’s create a dashboard (obvious) and start exploring. Click on any pic to use the it yourself.
Anything above 10 is immediately discarded. Even so, we see clustering towards the left.

The winning value is 5.403. Amazingly we can exclude everything above 5.65. This means that over 3000 entries (out of 4K) are very compettive! And we see three interesting peaks!

Now would be a good time to report my actual score. Remember, the overall range is 5.4 to 5.65. There is a clear peak a 5.47, and then two other peaks to the right.

Can you guess my score?

Yup, you guessed it. 5.4744!!!

Good for top 1000!

So how was I able to score so well? And, why were there so many submissions here? Well, someone was nice enough to share a submission template. Anybody who followed it could have been top 1000!!!

What’s very interesting is that’s there’s a distribution on each side of this peak. To me this suggests that most people submitted the template like I did. However, it seems that a large group of people tweaked it, some for the better, and some for the worse.

The other two peaks are probably earlier standard submissions.

This leads to another interesting question. What is we filter out all the template solutions and leave in the predictions that were abnormally good?

Yup, this is certainly more interesting. This definitely looks like the tail of a distribution. This shows that two solutions are clearly better that all the rest, even though the overall differences are incredibly small.

In fact, the difference between the template submission and the best is only 1.5%. But, in the financial markets, I bet every lit bit counts.

Here’s the dashboard code.

Dive into the data

What more can we learn from this Kaggle contest? Well, let’s think about how it’s structured first. Here are some screenshots of the key datasets. Left is submission, top right test/train, and bottom right is revealed targets.

Train/Test (Upper right) – typical market inputs that a trader is aware of (order and auction book data)
revealed_targets – gives you the true outcomes for the previous day, allowing your model to learn from recent history and potentially incorporate past target behaviour into its predictions for the current day
sample_submission –
Understanding the Submission File Entry: 26454 480_540_199 0
This line from the submission file has three components:
26454 – This is the time_id, a unique identifier for the specific prediction point in the time series. Time IDs help organize predictions chronologically and match them to the correct evaluation data points.
480_540_199 – This is the row_id, a unique identifier that links this prediction to the corresponding row in the test dataset. The format appears to be a composite identifier likely containing information about date_id_seconds_in_bucket_stock_id.
0 – This is the predicted target value. This is what participants are trying to predict: the 60-second future price movement of the stock relative to a synthetic index, measured in basis points (1 basis point = 0.01%).

Come again?

Basically, train and test contain current information, while revealed targets contains historical data. Optiver is hoping to use both to gain the best model for price volatility during the last 10 minutes of a trading day.

Another dashboard?

Of course! Let’s use it to visualize the input features. Here are some screenshots…

(Note: the deployed dash has a truncated database, so the visualizations won’t match up with the screenshots.)

Volatility distribution

WAP time series

WAP vs Volatility scatter plot

Volatility box plots

Here’s the code.

Which solution should we choose?

Well, it’s pretty clear. Use the best one! But what is the difference between the solutions? What makes that 1.5% difference?

Good question!

Let’s have Gemini compare the winning solution, the template baseline solution, and a custom solution written by Claude

Claude wrote a model?!?!

Yes! It’s actually true. I asked Claude to write a model combining the concepts of the baseline template solution and the 1st place submission. I was able to get the baseline running on my local system. However, the 1st place solution was too slow to complete in a reasonable time. I asked Claude to combine the two, sacrificing some performance for speed.

Solutions code

Gemini model comparison

Those are some big words…

I hear you. I can’t claim to understand all (or any) of them. But, the upshot is that the baseline models doesn’t really use the revealed targets, while both the winning and combined solutions do. The winning solution uses more fancy statistics (which explains its longer running time.)

How does the new solution perform?

Let’s submit and find out!

Does it perform better on the submission?

I’m not sure. I actually was able to submit late, but it doesn’t seem like it was scored.

How does Gemini describe Claude’s model?

“A GPU-accelerated LightGBM ensemble model that predicts short-term price movements by combining extensive feature engineering on order book data with adaptive features derived dynamically from recently revealed target values, trained using robust time-series cross-validation.”

More detail2

Any other ways of looking at this?

Maybe graphically? I asked Gemini to make flowcharts for each submission. These are really complicated, so the details really aren’t important. But, the simplest one on the top left is the baseline solution. The top right, much more complex, is the enhanced baseline, and the bottom (most complicated) is the winning solution.

This makes intuitive sense.

  1. @misc{optiver-trading-at-the-close,
    author = {Tom Forbes and John Macgillivray and Matteo Pietrobon and Sohier Dane and Maggie Demkin},
    title = {Optiver – Trading at the Close},
    year = {2023},
    howpublished = {\url{https://kaggle.com/competitions/optiver-trading-at-the-close}},
    note = {Kaggle}
    } ↩︎
  2. High-Level Summary:
    This script implements a sophisticated, GPU-accelerated LightGBM solution for the Optiver “Trading at the Close” competition. Its primary goal is to accurately predict the target variable (representing short-term price movement/volatility) by leveraging both historical market data patterns and, crucially, dynamically incorporating information from recently revealed targets provided during the competition’s inference phase.
    Key Components & Process:
    Core Model: It uses LightGBM, a gradient boosting framework known for its speed and accuracy, configured to run on an NVIDIA GPU (device: cuda) for faster training and inference.
    Feature Engineering (The Engine): This is the most complex part. The script generates a large set (146 in the final version) of predictive features from the raw order book data:
    Market Microstructure: Features capturing the state of the order book, such as bid-ask spreads, volume imbalances, weighted average price (WAP), price pressure, liquidity imbalances, and various pairwise/triplet interactions between different price/size levels.
    Temporal Dynamics: Features looking at how market variables change over short time windows (e.g., price differences, size shifts, returns over 1, 2, 3, 10 recent time steps).
    Statistical Aggregates: Summary statistics (mean, std) across related price or size columns at a given snapshot.
    Global Stock Behavior: Features characterizing the typical price/size behavior of a specific stock over the entire training period (optional, requires loading).
    Revealed Target Features (Key Enhancement): This is a distinguishing characteristic. It calculates features based on the actual target values revealed from previous time steps during inference (e.g., mean/volatility/trend/weighted mean of recent targets). This allows the model to adapt to very recent market conditions and ground truth.
    Robust Training (When Enabled):
    It uses 5-fold time-series cross-validation with gaps and purging to train models reliably and evaluate performance without data leakage common in time-series data.
    Early stopping is used within each fold to find the optimal number of boosting rounds and prevent overfitting.
    It trains models for each fold and a final model on all data, saving these artifacts.
    Ensemble Inference:
    For prediction, it loads the pre-trained fold models (and optionally the final model).
    It processes the incoming test data incrementally, updating the revealed target history and generating all features on-the-fly.
    It generates a final prediction by averaging the outputs of the loaded models (an ensemble approach).
    It applies post-processing steps like zero-sum adjustment (to potentially align with portfolio constraints) and clipping predictions to plausible bounds.
    ↩︎