Drawing

How Height has effected Win % in 1980 and 2020

by Vivek Mantha


Introduction

The game of professional basketball has changed greatly throughout the years. Throughout history, it has been generalized that basketball is a game meant for people who are tall. However, the introduction of the 3 point line and rules to make the game less physical have made it easier for smaller players to excel. 3 pointers being worth more than 2 pointers caused teams to shoot more 3 pointers and draft more 3 point shooters. A player’s height becomes less relevant if they are a good 3 point shooter. The 2016 MVP Steph Curry is only 6 foot 2; but, his ability to shoot the 3, and the fact that the league was less physical allowed him to excel despite his height. In 1980, a player of his size would have more trouble scoring due to how physical the league was, and how the game was played closer to the basket. The objective of my project is to illustrate the differences between the effect of height on win percentage in the year 1980 and the year 2020. Has ‘small ball’ really taken over the NBA?


Scraping and Processing the Data

The first step is to get the average height of each NBA team along with their win percentage. I used the ‘requests get’ function with a url that contains the table with my desired data. The first parameter of the request get function is that url while the second parameter contains my headers to bypass any security preferences. When performing a get request, a website expects a valid user agent which is why I have my mac user agent in my headers array. The request get function returns the data from the url in html format. Beautiful soup is then used to convert the data into a more readable format where the beautiful soup function ‘find’ is used with ‘prettify’ to locate the table. Prettify() allows us to use the ‘table’ keyword to find the table. I then use pandas read html function to convert this table into a pandas data frame. This process is repeated for every single team in the year 2020, and the year 1980. Alas, data on team heights are in separate tables so I had to do a request for every team. This is not the best practice and I would not have done it this way if I had all my data in one table. Here are some helpful links:
Pandas Scraping
Pandas Array Slicing
Numpy Documentation
Once I have my data in a pandas data frame, I can access the heights of any specific team by the column name ‘Ht’. For each team, I convert the Heights into inches in a method which returns an array player heights in inches. I then use the numpy mean function to get the mean of that array; that gives me my x values aka the average height for each team. The y value is a team's win percentage, which I input manually from basketball reference since there is no need to scrape this from any table. For every single team, I append their average height to an array of my x values and their corresponding win percentage to an array of y values. I now have the average height for each team in 1980 and 2020 and a corresponding win percentage array for each team in 1980 and 2020.

In [1]:

The 1980 cross-val score for both models was higher than the 2020 cross-val score for both models. The Logistic Regression Model had a better cross-val score for the year 1980 than the LDA model for the year 1980 while the 2020 cross-val score was the same for both models. The fact the the Logistic Regression Model's cross-val score for the year 1980 is 15% higher than its cross-val score for the year 2020 leads me to believe that there is a stronger correlation between height and win percentage in the year 1980 than in the year 2020 which supports my hypothesis. Unfortunately, neither of the models had a high cross-val score for either of the years. This means that the correlation between height and win percentage is not very strong, except for the Logistic Regression model's cross-val score of .678 for the year 1980. Nevertheless, using the models to predict weather the win percentages for small and big teams supported my hypothesis. Both 1980 models predicted that small teams would have a win % below 45% and that big teams would have a win % above 45%. Both the 2020 models predicted the exact opposite; that small teams would have a win % above 45% and that big teams would have a win % below 45%





Conclusion

Drawing

All you need to do to make sense of data is to read up on Python's libraries like pandas, numpy, sklearn, read up on different types of regression analysis, and follow this tutorial. Reflecting back on my initial hypothesis, the Linear Regression line showed a positive correlation between height and win% in 1980 and a negative one in 2020. The Logistic Regression analysis illustrated that height defined a team's success more in 1980 than in 2020, and showed the same correlations as the Linear Regression Model. The cross-val score of .52 for the 2020 Logistic Models leads me to believe that the correlation between height and win% in 2020 is too small to fully accept my hypothesis. Thus, I accept 2/3 parts of my initial hypothesis; I accept that height is positively correlated with win% in 1980, and that the correlation between height and win% is stronger in 1980 than in 2020. In 2020, there was not as much of a correlation between height and win percentage because of how the game has changed. Jump shots and 3 pointers have taken over the game where as in 1980 more players tried to score closer to the basket. Players who are taller tend to be better at scoring closer to the basket than smaller players. This is why height seems to have a stronger effect on win percentage in 1980 than in 2020. The 2020 Logistic Regression model had a smaller score than the 1980 model because in 2020 players of all heights can shoot jump shots and 3 pointers; making height less relevant in 2020. All in all, it seems that 'small ball' hasn't exactly taken over; but, 'tall ball' has diminished.