My First Kaggle Competition (Part One)

Sorry about the delay in getting this post out. It was a combination of a little bit of procrastination and being distracted by Azure. (More on the second one later.) Before I get in to all the wonderful details about my submission to the Kaggle competition and the code behind it, here is a little background.

I’ve been a fan of college basketball for quite a while now. I will watch just about any game, although I root for Texas. There is just something about college basketball that appeals to me that the NBA lacks. Maybe it’s because the athletes haven’t become rich prima donnas yet. Anyway, college basketball’s postseason is even better than any game during the regular season. In fact, that first weekend of March Madness may just be the most exciting playoff experience of any sport. (Although for the record, game 7 of the Stanley Cup Finals is the top playoff moment, in my opinion.)

In addition to watching the tournament, I love filling out a bracket and competing in my office pool. I won my first year. My secret strategy: I copied Joe Lunardi’s bracket. (Don’t tell him.) The next year his bracket wasn’t available, so I turned to Google. After some digging I stumbled upon the concept of the Four Factors. I cannot remember the name of the blog post I read or the author, but the general idea was to pick a winner of each tournament game based on comparing the Four Factors stats between the two teams. The author said that way, when you make the wrong pick, you can blame the system. I like that. If there is one thing that frustrates me to no end, it’s when I go back and forth on a matchup only change my pick at the last minute, but still think the other team is the better pick. Then of course that other team wins, and I spend the rest of the tournament beating myself up. Yes, I’d rather have a system that I can feed data and it tells me which team to pick.

So what I did was to compare the effective field goal percentage, turnover rate, offensive rebounding percentage, and free throw rate (plus the corresponding defensive stats), adjusted efficiency from Ken Pomeroy, and strength of schedule between the two teams to determine a winner. That’s ten stats each so I had to adjust the weights to ensure there wasn’t a tie. So I had my system but I had to get the data. So I would go to my favorite sports stats website, teamrankings.com, and manually input the numbers into my spreadsheet. It would take hours. And, because it took so long, I would need to get started before the end of the regular season. The stats would be updated after each game, so I had to go back and manually grab the updated numbers. Very tedious. There had to be a better way, right?

Enter Python. I won’t go into the details here in this post, but I wrote a webscraping application to automate the task of grabbing the data and inputting it into my spreadsheet. Check out my projects page for more on this project. No sooner had I finished this project did I discover machine learning. A better system! Instead of some arbitrary comparison of stats, how about predicting the winner based on previous results. Perfect!

It didn’t take long to discover Kaggle. A Google search of machine learning predict NCAA March Madness pretty much lead me to it, since Kaggle has been doing machine learning competitions on the NCAA College Basketball March Madness tournament for the past several years. The concept is pretty simple- they provide the rules and some data, and you (or you and your team) provide the submission. You can use whatever tools you prefer, and you can also use your own data. Checkout Kaggle if you haven’t already. They do more than machine learning competitions. They have thousands of datasets freely available, kernels in the form of Python or R notebooks, tutorials, and even job postings.

I realize that I’ve been rambling on for a while and I haven’t even gotten to any code yet. And that’s what everyone cares about, right? So instead of a longer post, I’ll opt for splitting this into a two-part blog post. As a reader I do not care for really long posts so it wouldn’t be right for me to impose the same thing on my readers. So, coming up in a future post- my code. Or, if you made your way to my projects page like I suggested earlier, you may have already found it on my GitHub. Thanks for reading!

Don’t mind me, I’m just rambling.