My First Kaggle Competition (Part Two)

This is the second part in a two-part series. For part one, click here.

Before I continue I’d like to point out that this is not a tutorial on machine learning or Scikit-Learn of any kind. If you would like to read more about those topics, I’ll suggest selecting a link or two in the Resources dropdown in the top right.

Now, we get to the good stuff- the code. Or, at least a summary of the code. There is too much to go line by line, so I’ll just point out the significant parts. Open up the code here and follow along.

Now, I had never done one of these Kaggle competitions so I wasn’t sure at first what to do. I spent some time browsing over some of the early submissions and tutorials to determine a good approach. I looked at everything from simple predictions that were based on point differential to basic stats such as shooting and turnover percentage to some serious moneyball type stuff. I read about people such as Jeff Sagarin and Dean Oliver and terms such as the Four Factors, Elo Rating, and Offensive Efficiency popped up in several articles.

After playing around and experimental with a few methods I eventually decided to use the same metrics that are used in my previous project, the NCAA Stats Webscaper. I decided on the following metrics (which became the feature set for the machine learning model):

Using this feature set, my model did a horrible job at predicting upsets (only a couple) and consequently performed terribly. I should have known that a March Madness tournament with very few upsets is just not that likely (you can always count on several), and this past tournament there were a lot. Actually, my overall rank in the competition wasn’t that bad. I placed 498 out of 934 teams, which is not terrible. However, I used the same submission to fill out my annual office bracket pool, and that failed with flying colors.

After the submission deadline and while the tournament was playing out, I did a little reading and continued to tinker with my model. After some research I added

I found that adding these four features made for a much better performing model, predicting many of the actual upsets.

The first part leads to creating a vector containing all the features for each team in every season from 2003-2018:

	Season	TeamID	Seed	OrdinalRank	PIE	eFGP	ToR	ORP	FTR	4Factor	AdjO	AdjD	AdjEM	DRP	ORTM	AR	FTP	PtsDf	WinPCT
0	2003	1328	1	3	0.604965	0.512124	0.155078	0.347284	0.332030	0.362880	112.1	89.1	23.01	0.709854	0.333333	0.153606	0.714351	11.000000	0.800000
1	2003	1448	2	7	0.637398	0.511972	0.178941	0.429724	0.472499	0.406344	114.5	94.6	19.87	0.687237	-0.344828	0.147679	0.755330	10.793103	0.827586
2	2003	1393	3	9	0.604162	0.515151	0.160408	0.385242	0.393873	0.382292	114.4	91.1	23.28	0.630790	0.689655	0.146059	0.687824	10.206897	0.827586
3	2003	1257	4	11	0.633058	0.528861	0.156505	0.356053	0.418922	0.384719	115.8	93.0	22.75	0.664037	-0.166667	0.162239	0.690997	13.366667	0.800000
4	2003	1280	5	24	0.643001	0.521359	0.197467	0.383446	0.323860	0.383178	107.4	85.8	21.53	0.695785	-2.966667	0.154033	0.666941	10.000000	0.700000

Next, the process is to isolate the winning and losing teams and create two new datasets with an added result column: one is the difference in feature vectors of the winners minus losers with a result of “1”; the other is losers minus winners with a result of “0”.

Finally, concatenate the winners and losers on top of each other and sort by season:

	Seed	OrdinalRank	PIE	eFGP	ToR	ORP	FTR	4Factor	AdjO	AdjD	AdjEM	DRP	ORTM	AR	FTP	PtsDf	WinPCT	result	Season	WTeamID	LTeamID
0	0	-31	-0.107021	-0.013235	0.012414	-0.012949	-0.152277	-0.027622	2.9	4.8	-1.90	-0.056104	-1.864368	-0.008846	0.152397	-9.208046	-0.151724	1	2003	1421	1411
28	11	46	-0.013858	-0.014261	0.011557	-0.052423	0.032148	-0.008478	-8.9	4.3	-13.24	0.027260	-3.586207	-0.002358	0.100614	-1.793103	-0.034483	0	2003	1393	1264
27	-1	9	0.082039	0.067915	0.003750	0.025868	-0.113316	0.016280	4.0	3.2	0.73	-0.004268	0.170507	0.039910	-0.103694	3.822581	0.034562	0	2003	1345	1261
26	13	95	-0.154728	-0.025687	-0.007815	0.001559	-0.096460	-0.026386	-10.1	19.7	-29.77	-0.032008	0.832258	-0.018119	0.083997	-11.669892	-0.189247	0	2003	1338	1447
25	5	37	0.032360	0.069025	-0.003349	-0.055668	-0.040919	0.009501	4.4	8.7	-4.32	0.091727	-1.390805	0.037968	0.036850	4.260536	0.125160	0	2003	1329	1335

What we end up with is a 21 by 1962 dataframe. The columns are the 17 features, the Result column, and three extra columns for Season, Winning Team ID, and Losing Team ID. Each row is a simulated “matchup” between tournament teams.

The second part covers the training, evaluation, and testing of the machine learning models, and the creation of submission file and more readible file that could be used to create an unbeatable bracket for your office pool.

The first step is separating the dataset into training and test data:

y = prediction_dataset['result']
X = prediction_dataset.loc[:, :'WinPCT']
train_IDs = prediction_dataset.loc[:, 'Season':]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=1, stratify=y)

I chose to go with with a 70/30 split of training to test data, which seemed to work considering the amount of data I had.

The next step is to initialize the six different classifiers:

and evaluate them using a grid search cross validation. Without going into too much detail, the Logistic Regression classifier peformed the best, returning a test accuracy of 79.80%. I didn’t realize this at the time, but it turns out I made a significant error in evaluating the classifiers. I used the training set in the grid search cross validation. What I should have done was have a validation set in addition to the training and test sets. (Not so much of an independent evaluation, is it?)

Anyway, after fitting the model, I create a prediction dataset to simulate every possible match up of teams in the 2018 March Madness tournament:

	ID	Pred
0	2018_1104_1112	0.438796
1	2018_1104_1113	0.352291
2	2018_1104_1116	0.338818
3	2018_1104_1120	0.362255
4	2018_1104_1137	0.719037

The three numbers in the ID column are the year of the tournament, the team #1, and team #2. The number in the Pred column is the probability that team #1 beats team #2. The format is required by Kaggle and doesn’t really help if you want to create a bracket.

The final piece of the code matches up the team IDs with team names and generates readable predictions, such as “Alabama beats Arizona: 0.563389”. This doesn’t necessarily guarantee that Alabama plays Arizona, but if they did there is 56% probability that Alabama wins.

So that’s it, my first Kaggle competition. I have to admit I jumped into this knowing very little about supervised learning, Scikit-Learn, and classifier models, learning most of it on the fly. It was a good experience and I will definitely do it again next year. Maybe I will try different features and difference classifiers. Who knows, maybe I try some deep learning models. One thing I do know- I won’t use training data on my cross validation!

Don’t mind me, I’m just rambling.