Recent Orders

Our Reviews

Sample Papers

How It Works

Get First 2 Pages Of Your Homework Absolutely Free!

Messages

Welcome to TutorsOnSpot.Com!

World's No. 1 Assignment Writing Market

Post Your Homework

Proposals

Post your homework and get free proposals here!

Post Your Homework

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Get Free Quotes Post Your Requirements

Offset air cleaner stud o reilly

21/10/2021 Client: muhammad11 Deadline: 2 Day

Sports Analytics and Data Science

Winning the Game with Methods and Models

THOMAS W. MILLER

Publisher: Paul Boger

Editor-in-Chief: Amy Neidlinger

Executive Editor: Jeanne Glasser Levine

Cover Designer: Alan Clements

Managing Editor: Kristy Hart

Project Editor: Andy Beaster

Manufacturing Buyer: Dan Uhrig

Old Tappan, New Jersey 07675

For information about buying this title in bulk quantities, or for special sales opportunities

(which may include electronic versions; custom cover designs; and content particular

to your business, training goals, marketing focus, or branding interests), please contact

our corporate sales department at corpsales@pearsoned.com or (800) 382-3419.

For government sales inquiries, please contact governmentsales@pearsoned.com.

For questions about sales outside the U.S., please contact international@pearsoned.com.

Company and product names mentioned herein are the trademarks or registered

trademarks of their respective owners.

means, without permission in writing from the publisher.

Printed in the United States of America

First Printing November 2015

ISBN-10: 0-13-388643-3

ISBN-13: 978-0-13-388643-6

Pearson Education LTD.

Pearson Education Australia PTY, Limited.

Pearson Education Singapore, Pte. Ltd.

Pearson Education Asia, Ltd.

Pearson Education Canada, Ltd.

Pearson Educación de Mexico, S.A. de C.V.

Pearson Education—Japan

Pearson Education Malaysia, Pte. Ltd.

Library of Congress Control Number: 2015954509

Contents

Preface v

Figures ix

Tables xi

Exhibits xiii

1 Understanding Sports Markets 1

2 Assessing Players 23

3 Ranking Teams 37

4 Predicting Scores 49

5 Making Game-Day Decisions 61

6 Crafting a Message 69

7 Promoting Brands and Products 101

8 Growing Revenues 119

9 Managing Finances 133

iii

iv Sports Analytics and Data Science

10 Playing What-if Games 147

11 Working with Sports Data 169

12 Competing on Analytics 193

A Data Science Methods 197

A.1 Mathematical Programming 200

A.2 Classical and Bayesian Statistics 203

A.3 Regression and Classification 206

A.4 Data Mining and Machine Learning 215

A.5 Text and Sentiment Analysis 217

A.6 Time Series, Sales Forecasting, and Market Response Models 226

A.7 Social Network Analysis 230

A.8 Data Visualization 234

A.9 Data Science: The Eclectic Discipline 240

B Professional Leagues and Teams 255

Data Science Glossary 261

Baseball Glossary 279

Bibliography 299

Index 329

Preface

“Sometimes you win, sometimes you lose, sometimes it rains.”

—TIM ROBBINS AS EBBY CALVIN LALOOSH IN Bull Durham (1988)

Businesses attract customers, politicians persuade voters, websites cajole visitors, and sports teams draw fans. Whatever the goal or target, data and models rule the day.

This book is about building winning teams and successful sports businesses. Winning and success are more likely when decisions are guided by data and models. Sports analytics is a source of competitive advantage.

This book provides an accessible guide to sports analytics. It is written for anyone who needs to know about sports analytics, including players, managers, owners, and fans. It is also a resource for analysts, data scientists, and programmers. The book views sports analytics in the context of data science, a discipline that blends business savvy, information technology, and modeling techniques.

To use analytics effectively in sports, we must first understand sports— the industry, the business, and what happens on the fields and courts of play. We need to know how to work with data—identifying data sources, gathering data, organizing and preparing them for analysis. We also need to know how to build models from data. Data do not speak for themselves. Useful predictions do not arise out of thin air. It is our job to learn from data and build models that work.

vi Sports Analytics and Data Science

The best way to learn about sports analytics and data science is through examples. We provide a ready resource and reference guide for modeling techniques. We show programmers how to solve real world problems by building on a foundation of trustworthy methods and code.

The truth about what we do is in the programs we write. The code is there for everyone to see and for some to debug. Data sets and computer programs are available from the website for the Modeling Techniques se- ries at http://www.ftpress.com/miller/. There is also a GitHub site at https://github.com/mtpa/.

When working on sports problems, some things are more easily accom- plished with R, others with Python. And there are times when it is good to offer solutions in both languages, checking one against the other.

One of the things that distinguishes this book from others in the area of sports analytics is the range of data sources and topics discussed. Many re- searchers focus on numerical performance data for teams and players. We take a broader view of sports analytics—the view of data science. There are text data as well as numeric data. And with the growth of the World Wide Web, the sources of data are plentiful. Much can be learned from public domain sources through crawling and scraping the web and utilizing ap- plication programming interfaces (APIs).

I learn from my consulting work with professional sports organizations. Research Publishers LLC with its ToutBay division promotes what can be called “data science as a service.” Academic research and models can take us only so far. Eventually, to make a difference, we need to implement our ideas and models, sharing them with one another.

Many have influenced my intellectual development over the years. There were those good thinkers and good people, teachers and mentors for whom I will be forever grateful. Sadly, no longer with us are Gerald Hahn Hinkle in philosophy and Allan Lake Rice in languages at Ursinus College, and Herbert Feigl in philosophy at the University of Minnesota. I am also most thankful to David J. Weiss in psychometrics at the University of Minnesota and Kelly Eakin in economics, formerly at the University of Oregon.

http://www.ftpress.com/miller/
https://github.com/mtpa/
Preface vii

My academic home is the Northwestern University School of Professional Studies. Courses in sports research methods and quantitative analysis, mar- keting analytics, database systems and data preparation, web and network data science, web information retrieval and real-time analytics, and data visualization provide inspiration for this book. Thanks to the many stu- dents and fellow faculty from whom I have learned. And thanks to col- leagues and staff who administer excellent graduate programs, including the Master of Science in Predictive Analytics, Master of Arts in Sports Ad- ministration, Master of Science in Information Systems, and the Advanced Certificate in Data Science.

Lorena Martin reviewed this book and provided valuable feedback while she authored a companion volume on sports performance measurement and analytics (Martin 2016). Adam Grossman and Tom Robinson provided valuable feedback about coverage of topics in sports business management. Roy Sanford provided advice on statistics. Amy Hendrickson of TEXnology Inc. applied her craft, making words, tables, and figures look beautiful in print—another victory for open source. Candice Bradley served dual roles as a reviewer and copyeditor for all books in the Modeling Techniques series. And Andy Beaster helped in preparing this book for final production. I am grateful for their guidance and encouragement.

Thanks go to my editor, Jeanne Glasser Levine, and publisher, Pearson/FT Press, for making this book possible. Any writing issues, errors, or items of unfinished business, of course, are my responsibility alone.

My good friend Brittney and her daughter Janiya keep me company when time permits. And my son Daniel is there for me in good times and bad, a friend for life. My greatest debt is to them because they believe in me.

Thomas W. Miller Glendale, California October 2015

This page intentionally left blank

Figures

1.1 MLB, NBA, and NFL Average Annual Salaries 10 1.2 MLB Team Payrolls and Win/Loss Performance (2014 Season) 11 1.3 A Perceptual Map of Seven Sports 13 2.1 Multitrait-Multimethod Matrix for Baseball Measures 25 3.1 Assessing Team Strength: NBA Regular Season (2014–2015) 40 4.1 Work of Data Science 50 4.2 Data and Models for Research 52 4.3 Training-and-Test Regimen for Model Evaluation 54 4.4 Training-and-Test Using Multi-fold Cross-validation 56 4.5 Training-and-Test with Bootstrap Resampling 57 4.6 Predictive Modeling Framework for Team Sports 59 6.1 How Sports Fit into the Entertainment Space (Or Not) 72 6.2 Indices of Dissimilarity Between Pairs of Binary Variables 73 6.3 Consumer Preferences for Dodger Stadium Seating 77 6.4 Choice Item for Assessing Willingness to Pay for Tickets 79 6.5 The Market: A Meeting Place for Buyers and Sellers 80 7.1 Dodgers Attendance by Day of Week 104 7.2 Dodgers Attendance by Month 104 7.3 Dodgers Weather, Fireworks, and Attendance 106 7.4 Dodgers Attendance by Visiting Team 107 7.5 Regression Model Performance: Bobbleheads and Attendance 108 8.1 Competitive Analysis for an NBA Team: Golden State Warriors 129 9.1 Cost-Volume-Profit Analysis 135 9.2 Higher Profits Through Increased Sales 136 9.3 Higher Profits Through Lower Fixed Costs 137 9.4 Higher Profits Through Increased Efficiency 137 9.5 Decision Analysis: Investing in a Sports Franchise (Or Not) 143 10.1 Game-day Simulation (Offense Only) 152

x Sports Analytics and Data Science

10.2 Mets’ Away and Yankees’ Home Data (Offense and Defense) 154 10.3 Balanced Game-day Simulation (Offense and Defense) 155 10.4 Actual and Theoretical Runs-scored Distributions 157 10.5 Poisson Model for Mets vs. Yankees at Yankee Stadium 159 10.6 Negative Binomial Model for Mets vs. Yankees at Yankee Stadium 160 10.7 Probability of Home Team Winning (Negative Binomial Model) 162 10.8 Strategic Modeling Techniques in Sports 164 11.1 Software Stack for a Document Search and Selection System 173 11.2 The Information Supply Chain of Professional Team Sports 174 11.3 Automated Data Acquisition by Crawling, Scraping, and Parsing 177 11.4 Automated Data Acquisition with an API 179 11.5 Gathering and Organizing Data for Analysis 180 A.1 Mathematical Programming Modeling Methods 201 A.2 Evaluating the Predictive Accuracy of a Binary Classifier 212 A.3 Linguistic Foundations of Text Analytics 218 A.4 Creating a Terms-by-Documents Matrix 221 A.5 Data and Plots for the Anscombe Quartet 235 A.6 Visualizing Many Games Across a Season: Differential Runs Plot 236 A.7 Moving Fraction Plot for Basketball 237 A.8 Visualizing Basketball Play-by-Play Data 239 A.9 Data Science: The Eclectic Discipline 241

Tables

1.1 Sports and Recreation Activities in the United States 3 1.2 MLB Team Valuation and Finances (March 2015) 5 1.3 NBA Team Valuation and Finances (January 2015) 6 1.4 NFL Team Valuation and Finances (August 2014) 7 1.5 World Soccer Team Valuation and Finances (May 2015) 8 2.1 Levels of Measurement 29 3.1 NBA Team Records (2014–2015 Season) 39 5.1 Twenty-five States of a Baseball Half-Inning 63 6.1 Dissimilarity Matrix for Entertainment Events and Activities 71 6.2 Consumer Preference Data for Dodger Stadium Seating 76 7.1 Bobbleheads and Dodger Dogs 103 7.2 Regression of Attendance on Month, Day of Week, and Promotion 110 9.1 Discounted Cash Flow Analysis of a Player Contract 139 9.2 Would you like to buy the Brooklyn Nets? 141 10.1 New York Mets’ Early Season Games in 2007 149 10.2 New York Yankees’ Early Season Games in 2007 150 A.1 Three Generalized Linear Models 209 A.2 Social Network Data: MLB Player Transactions 233 B.1 Women’s National Basketball Association (WNBA) 255 B.2 Major League Baseball (MLB) 256 B.3 Major League Soccer (MLS) 257 B.4 National Basketball Association (NBA) 258 B.5 National Football League (NFL) 259 B.6 National Hockey League (NHL) 260

This page intentionally left blank

Exhibits

1.1 MLB, NBA, and NFL Player Salaries (R) 16 1.2 Payroll and Performance in Major League Baseball (R) 18 1.3 Making a Perceptual Map of Sports (R) 19 3.1 Assessing Team Strength by Unidimensional Scaling (R) 43 6.1 Mapping Entertainment Events and Activities (R) 83 6.2 Mapping Entertainment Events and Activities (Python) 86 6.3 Preferences for Sporting Events—Conjoint Analysis (R) 88 6.4 Preferences for Sporting Events—Conjoint Analysis (Python) 99 7.1 Shaking Our Bobbleheads Yes and No (R) 113 7.2 Shaking Our Bobbleheads Yes and No (Python) 116 10.1 Team Winning Probabilities by Simulation (R) 167 10.2 Team Winning Probabilities by Simulation (Python) 168 11.1 Simple One-Site Web Crawler and Scraper (Python) 186 11.2 Gathering Opinion Data from Twitter: Football Injuries (Python) 189 A.1 Programming the Anscombe Quartet (Python) 242 A.2 Programming the Anscombe Quartet (R) 244 A.3 Making Differential Runs Plots for Baseball (R) 245 A.4 Moving Fraction Plot: A Basketball Example (R) 246 A.5 Visualizing Basketball Games (R) 248 A.6 Seeing Data Science as an Eclectic Discipline (R) 252

xiii

This page intentionally left blank

1 Understanding Sports Markets

“Those of you on the floor at the end of the game, I’m proud of you. You played your guts out. I’m only going to say this one time. All of you have the weekend. Think about whether or not you want to be on this team under the following condition: What I say when it comes to this basketball team is the law, absolutely and without discussion.”

—GENE HACKMAN AS COACH NORMAN DALE IN Hoosiers (1986)

In applying the laws of economics to professional sports, we must consider the nature of sports and the motives of owners. Professional sports are different from other forms of business.

There are sellers and buyers of sports entertainment. The sellers are the players and teams within the leagues of professional sports. The buyers are consumers of sports, many of whom never go to games in person but who watch sports on television, listen to the radio, and buy sports team paraphernalia.

Sports compete with other forms of entertainment for people’s time and money. And various sports compete with one another, especially when their seasons overlap. Sports teams produce entertainment content that is distributed through the media. Sports teams license their brand names and logos to other organizations, including sports apparel manufacturers.

2 Sports Analytics and Data Science

Sports teams are not independent businesses competing with one another. While players and teams compete on the fields and courts of play, they cooperate with one another as members of leagues. The core product of sports is the sporting contest, a joint product of two or more players or two or more teams.

Fifty-four sports and recreation activities, shown in table 1.1, are tracked by the National Sporting Goods Association (2015), which serves the sporting goods industry. In recent years, participation in baseball, basketball, foot- ball, and tennis has declined, while participation in soccer has increased. There has been growth in individual recreational sports, such as skate- boarding and snowboarding. Of course, levels of participation in sports are not necessarily an indicator of levels of interest in sports as entertainment.

Sports businesses produce entertainment products by cooperating with one another. While it is illegal for businesses in most industries to collude in setting output and prices, sports leagues engage in cooperative output and pricing as a standard part of their business model. The number of games, indeed the entire schedule of games in a sport, is determined by the league. In fact, aspects of professional sports are granted monopoly power by the federal government in the United States.

When developing a model for a typical business or firm, an economist would assume profit maximization as a motive. But for a professional sports team, an owner’s motives may not be so easily understood. While one owner may operate his or her team for profit year by year, another may seek to maximize wins or overall utility. Another may look for capital appreciation—buying, then selling after a few years. Lacking knowledge of owners’ motives, it is difficult to predict what they will do.

Gaining market share and becoming the dominant player is a goal of firms in many industries. Not so in the business of professional sports. If one team were assured of victory in almost all of its contests, interest in those contests could wane. A team benefits by winning more often than losing, but winning all the time may be less beneficial than winning most of the time. Professional sports leagues claim to be seeking competitive balance, although there are dominant teams in many leagues.

Chapter 1. Understanding Sports Markets 3

Table 1.1. Sports and Recreation Activities in the United States

Aerobic Exercising Ice/Figure Skating Archery (Target) In-Line Roller Skating Backpack/Wilderness Camping Kayaking Baseball Lacrosse Basketball Martial Arts/MMA/Tae Kwon Do Bicycle Riding Mountain Biking (Off Road) Billiards/Pool Muzzleloading Boating (Motor/Power) Paintball Games Bowling Running/Jogging Boxing Scuba Diving (Open Water) Camping (Vacation/Overnight) Skateboarding Canoeing Skiing (Alpine) Cheerleading Skiing (Cross Country) Dart Throwing Snowboarding Exercise Walking Soccer Exercising with Equipment Softball Fishing (Fresh Water) Swimming Fishing (Salt Water) Table Tennis/Ping Pong Football (Flag) Target Shooting (Airgun) Football (Tackle) Target Shooting (Live Ammunition) Football (Touch) Tennis Golf Volleyball Gymnastics Water Skiing Hiking Weight Lifting Hockey (Ice) Work Out at Club/Gym/Fitness Studio Hunting with Bow & Arrow Wrestling Hunting with Firearms Yoga

4 Sports Analytics and Data Science

Sports is big business as shown by valuations and finances of the major pro- fessional sports in the United States and worldwide. Data from Forbes for Major League Baseball (MLB), the National Basketball Association (NBA), the National Football League (NFL), and worldwide soccer teams are shown in tables 1.2 through 1.5.

Professional sports teams most certainly compete with one another in the labor market, and labor in the form of star players is in short supply. Some argue that salary caps are necessary to preserve competitive balance. Salary caps also help teams in limiting expenditures on players.

Most professional sports in the United States have salary caps. The 2015 salary cap for NFL teams, with fifty-three player rosters, is set at $143.28 million (Patra 2015). Most teams have payrolls at or near the cap, mak- ing the average salary of an NFL player about $2.7 million. One player on an NFL team may be designated as a franchise player, restricting that player from entering free agency. The league sets minimum salaries for franchise players. For example, a franchise quarterback has a minimum salary of $18.544 million in 2015. The highest annual salary among NFL players is $22 million for Aaron Rodgers, Green Bay Packers quarterback (spotrac 2015c). The minimum annual salary is $420 thousand.

NBA teams have a $70 million salary cap for the 2015–16 season, with penalties for teams going over the cap. Maximum player salaries are based on a percentage of cap and years of service. For example, LeBron James, with ten years of experience, would have a maximum salary of $23 million (Mahoney 2015). New Orleans Pelicans Anthony Davis’ average salary of $29 million is the highest among NBA players (spotrac 2015b). Team rosters include fifteen players under contract, with as many as thirteen available to play in any particular game. The minimum annual salary is $428,498.

Major League Baseball (MLB) has a “luxury tax” for teams with payrolls in excess of $189 million. There is a regular-player roster of twenty-five or twenty-six players for double-header days/nights. A forty-man roster includes players under contract and eligible to play. Between September 1 and the end of the regular season the roster is expanded to forty players. The roster drops back to twenty-five players for the playoffs. The minimum MLB annual salary is $505,700 in 2015. The highest MLB annual salary is $31 million for Miguel Cabrera of the Detroit Tigers (spotrac 2015a).

Chapter 1. Understanding Sports Markets 5

Table 1.2. MLB Team Valuation and Finances (March 2015)

One-Year Current Change Operating

Team Value in Value Debt/Value Revenue Income Rank Team ($ Millions) (Percentage) (Percentage) ($ Millions) ($ Millions)

1 New York Yankees 3,200 28 0 508 8.1 2 Los Angeles Dodgers 2,400 20 17 403 -12.2 3 Boston Red Sox 2,100 40 0 370 49.2 4 San Francisco Giants 2,000 100 4 387 68.4 5 Chicago Cubs 1,800 50 24 302 73.3 6 St Louis Cardinals 1,400 71 21 294 73.6 7 New York Mets 1,350 69 26 263 25.0 8 Los Angeles Angels 1,300 68 0 304 16.7 9 Washington Nationals 1,280 83 27 287 41.4 10 Philadelphia Phillies 1,250 28 8 265 -39.0 11 Texas Rangers 1,220 48 13 266 3.5 12 Atlanta Braves 1,150 58 0 267 33.2 13 Detroit Tigers 1,125 65 15 254 -20.7 14 Seattle Mariners 1,100 55 0 250 26.4 15 Baltimore Orioles 1,000 61 15 245 31.4 16 Chicago White Sox 975 40 5 227 31.9 17 Pittsburgh Pirates 900 57 10 229 43.6 18 Minnesota Twins 895 48 25 223 21.3 19 San Diego Padres 890 45 22 224 35.0 20 Cincinnati Reds 885 48 6 227 2.2 21 Milwaukee Brewers 875 55 6 226 11.3 22 Toronto Blue Jays 870 43 0 227 -17.9 23 Colorado Rockies 855 49 7 214 12.6 24 Arizona Diamondbacks 840 44 17 211 -2.2 25 Cleveland Indians 825 45 9 207 8.9 26 Houston Astros 800 51 34 175 21.6 27 Oakland Athletics 725 46 8 202 20.8 28 Kansas City Royals 700 43 8 231 26.6 29 Miami Marlins 650 30 34 188 15.4 30 Tampa Bay Rays 625 29 22 188 7.9

Source. Badenhausen, Ozanian, and Settimi (2015b).

6 Sports Analytics and Data Science

Table 1.3. NBA Team Valuation and Finances (January 2015)

One-Year Current Change Operating

Team Value in Value Debt/Value Revenue Income Rank Team ($ Millions) (Percentage) (Percentage) ($ Millions) ($ Millions)

1 Los Angeles Lakers 2,600 93 2 293 104.1 2 New York Knicks 2,500 79 0 278 53.4 3 Chicago Bulls 2,000 100 3 201 65.3 4 Boston Celtics 1,700 94 9 173 54.9 5 Los Angeles Clippers 1,600 178 0 146 20.1 6 Brooklyn Nets 1,500 92 19 212 -99.4 7 Golden State Warriors 1,300 73 12 168 44.9 8 Houston Rockets 1,250 61 8 175 38.0 9 Miami Heat 1,175 53 8 188 12.6 10 Dallas Mavericks 1,150 50 17 168 30.4 11 San Antonio Spurs 1,000 52 8 172 40.9 12 Portland Trail Blazers 940 60 11 153 11.7 13 Oklahoma City Thunder 930 58 15 152 30.8 14 Toronto Raptors 920 77 16 151 17.9 15 Cleveland Cavaliers 915 78 22 149 20.6 16 Phoenix Suns 910 61 20 145 28.2 17 Washington Wizards 900 86 14 143 10.1 18 Orlando Magic 875 56 17 143 20.9 19 Denver Nuggets 855 73 1 136 14.0 20 Utah Jazz 850 62 6 142 32.7 21 Indiana Pacers 830 75 18 139 25.0 22 Atlanta Hawks 825 94 21 133 14.8 23 Detroit Pistons 810 80 23 144 17.6 24 Sacramento Kings 800 45 29 125 8.9 25 Memphis Grizzlies 750 66 23 135 10.5 26 Charlotte Hornets 725 77 21 130 1.2 27 Philadelphia 76ers 700 49 21 125 24.4 28 New Orleans Pelicans 650 55 19 131 19.0 29 Minnesota Timberwolves 625 45 16 128 6.9 30 Milwaukee Bucks 600 48 29 110 11.5

Source. Badenhausen, Ozanian, and Settimi (2015a).

Chapter 1. Understanding Sports Markets 7

Table 1.4. NFL Team Valuation and Finances (August 2014)

One-Year Current Change Operating

Team Value in Value Debt/Value Revenue Income Rank Team ($ Millions) (Percentage) (Percentage) ($ Millions) ($ Millions)

1 Dallas Cowboys 3,200 39 6 560 245.7 2 New England Patriots 2,600 44 9 428 147.2 3 Washington Redskins 2,400 41 10 395 143.4 4 New York Giants 2,100 35 25 353 87.3 5 Houston Texans 1,850 28 11 339 102.8 6 New York Jets 1,800 30 33 333 79.5 7 Philadelphia Eagles 1,750 33 11 330 73.2 8 Chicago Bears 1,700 36 6 309 57.1 9 San Francisco 49ers 1,600 31 53 270 24.8 10 Baltimore Ravens 1,500 22 18 304 56.7 11 Denver Broncos 1,450 25 8 301 30.7 12 Indianapolis Colts 1,400 17 4 285 60.7 13 Green Bay Packers 1,375 16 1 299 25.6 14 Pittsburgh Steelers 1,350 21 15 287 52.4 15 Seattle Seahawks 1,330 23 9 288 27.3 16 Miami Dolphins 1,300 21 29 281 8.0 17 Carolina Panthers 1,250 18 5 283 55.6 18 Tampa Bay Buccaneers 1,225 15 15 275 46.4 19 Tennessee Titans 1,160 10 11 278 35.6 20 Minnesota Vikings 1,150 14 43 250 5.3 21 Atlanta Falcons 1,125 21 27 264 13.1 22 Cleveland Browns 1,120 11 18 276 35.0 23 New Orleans Saints 1,110 11 7 278 50.1 24 Kansas City Chiefs 1,100 9 6 260 10.0 25 Arizona Cardinals 1,000 4 15 266 42.8 26 San Diego Chargers 995 5 10 262 39.9 27 Cincinnati Bengals 990 7 10 258 11.9 28 Oakland Raiders 970 18 21 244 42.8 29 Jacksonville Jaguars 965 15 21 263 56.9 30 Detroit Lions 960 7 29 254 -15.9 31 Buffalo Bills 935 7 13 252 38.0 32 St Louis Rams 930 6 12 250 16.2

Source. Badenhausen, Ozanian, and Settimi (2014).

8 Sports Analytics and Data Science

Table 1.5. World Soccer Team Valuation and Finances (May 2015)

One-Year Current Change Operating

Team Value in Value Debt/Value Revenue Income Rank Team ($ Millions) (Percentage) (Percentage) ($ Millions) ($ Millions)

1 Real Madrid 3,263 -5 4 746 170 2 Barcelona 3,163 -1 3 657 174 3 Manchester United 3,104 10 20 703 211 4 Bayern Munich 2,347 27 0 661 78 5 Manchester City 1,375 59 0 562 122 6 Chelsea 1,370 58 0 526 83 7 Arsenal 1,307 -2 30 487 101 8 Liverpool 982 42 10 415 86 9 Juventus 837 -2 9 379 50 10 AC Milan 775 -10 44 339 54 11 Borussia Dortmund 700 17 6 355 55 12 Paris Saint-Germain 634 53 0 643 -1 13 Tottenham Hotspur 600 17 9 293 63 14 Schalke 04 572 -1 0 290 57 15 Inter Milan 439 -9 56 222 -41 16 Atletico de Madrid 436 33 53 231 47 17 Napoli 353 19 0 224 43 18 Newcastle United 349 33 0 210 44 19 West Ham United 309 33 12 186 54 20 Galatasaray 294 -15 17 220 -37

Source. Ozanian (2015).

Chapter 1. Understanding Sports Markets 9

Figure 1.1, a histogram lattice, shows how player salaries compare across the MLB, NBA, and NFL in August 2015. Player salary distributions are positively skewed. The mean salary across NFL players is around $1.7 mil- lion, but the median is $630 thousand. The mean salary across NBA players is around $5.1 million, with median salary $2.8 million. The mean salary across MLB players is around $4.1 million, with the median $1.1 million.

Do team expenditures on players buy success? This is a meaningful ques- tion to ask for leagues that have no salary caps. Szymanski (2015) reports studies showing that between 60 and 90 percent of the variability in U.K. soccer team positions may be explained by wages paid to players. Major League Baseball has a luxury tax in place of a salary cap, and team pay- rolls vary widely in size. The New York Yankees have been known for having the highest payrolls in baseball. Recently, the Los Angeles Dodgers have surpassed the Yankees with the highest player payroll—more than $257 million at the end of the 2014 season (Woody 2014).

Figure 1.2 shows baseball team salaries at the beginning of the 2014 sea- son plotted against the percentage of games won across the regular season. Notice how teams that made the playoffs in 2014, labeled with team ab- breviations, have a wide range of payrolls. While the biggest spenders in baseball are often among the set of teams going to the playoffs, the relation- ship between team payrolls and team performance is weak at best—less than 7 percent of the variability in win/loss percentages is explained by player payrolls.

The thesis of Michael Lewis’ Moneyball (2003) and what has become the ethos of sports analytics is that small-market baseball teams can win by spending their money wisely. Star players demand top salaries due as much to their celebrity status as to their skills. Players with high on-base percentages, overlooked by major-market teams, can be hired at much lower salaries than star players.

Teams, although associated with particular cities, can be known nationwide or worldwide. The media of television and the Internet provide opportu- nities for reaching consumers across the globe. A Super Bowl at the Rose Bowl in Pasadena, California or AT&T Stadium in Arlington, Texas may be attended by around 100 thousand fans (Alder 2015), while U.S. television audiences have grown to over 100 million (statista 2015).

10 Sports Analytics and Data Science

Figure 1.1. MLB, NBA, and NFL Average Annual Salaries

Annual Salary ($ millions)

D en

si ty

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0 10 20 30

Major League Baseball 0.00

0.05

0.10

0.15

0.20

0.25

0.30

National Basketball Association 0.00

0.05

0.10

0.15

0.20

0.25

0.30

National Football League

Sources. spotrac (2015a, 2015b, 2015c).

Chapter 1. Understanding Sports Markets 11

Figure 1.2. MLB Team Payrolls and Win/Loss Performance (2014 Season)

�

� �

�

� � �

��

� �

��

� � �

� �

�

LAA BAL WSN

LAD

STL DET

SFGPIT & OAK

CLE NYY

TOR MIL

ATL

MIA

CHC PHI

BOS MIN

TEX COL

ARI

KCR SEA

NYM

SDP & TBR CIN

CHW

HOU

50 100 150 200 250 Team Payroll (Millions of Dollars)

Pe rc

en ta

ge o

f G am

es W

Sources. Sports Reference LLC (2015b) and USA Today (2015).

See Appendix B, page 255, for team abbreviations and names.

12 Sports Analytics and Data Science

Media revenues are important to successful sports teams. Other revenues come from business partnerships, sponsorships, advertising, and stadium naming rights. City governments understand well the power of sports to promote business. Locating sports arenas in cities can help to revitalize downtown areas, as demonstrated by the experience of the Oklahoma City Thunder. Indianapolis, Indiana promotes itself as a sports capital with the Colts and Pacers (Rein, Shields, and Grossman 2015).

Teams seek to build their brands, developing a positive reputation in the minds of consumers. Players, like fans, are attracted to teams with a rep- utation for hard work, courage, fair play, honesty, teamwork, and commu- nity service. The character of a team is often as important as its likelihood of winning. The Cubs are associated with Chicago, but Cub fans may be found from Maine to California. This is despite the fact that the Cubs have not won the World Series since 1908. Teams in U.S. professional sports vie to become “America’s team,” with fans across the land wearing their logo- embossed hats and jerseys.

The demand for sports and the feelings of sports consumers are not so eas- ily understood. Fans can be fickle and fandom fleeting. Fans can be loyal to a sport, to a team, or to individual players. Multivariate methods can help us understand how sports consumers think by revealing relationships among products or brands.

Figure 1.3 provides an example, a perceptual map of seven sports. Along the horizontal dimension, we move from individual, non-contact sports on the left-hand side, to team sports with little contact, to team sports with contact on the right-hand side. The vertical dimension, less easily described, may be thought of as relating to the aerobic versus anaerobic nature of sports and to other characteristics such as physicality and skill. Sports such as tennis, soccer, and basketball entail aerobic exercise. These are endurance sports, while football is an example of a sport that involves both aerobic and anaerobic exercise, including intense exercise for short durations. Sports close together on the map have similarities. Baseball and golf, for example, involve special skills, such as precision in hitting a ball. Soccer and hockey involve almost continuous movement and getting a ball through the goal. Football and hockey have high physicality or player contact.

Chapter 1. Understanding Sports Markets 13

Figure 1.3. A Perceptual Map of Seven Sports

First Dimension (Individual/Team, Degree of Contact)

S ec

on d

D im

en si

on (A

na er

ob ic

/A er

ob ic

, O th

er )

Baseball

Basketball

Football

Soccer Tennis

Hockey

Golf

In many respects, professional sports teams are decidedly different from other businesses. They are in the public eye. They live and die in the media. And a substantial portion of their revenues come from media.

Késenne (2007), Szymanski (2009), Fort (2011), Fort and Winfree (2013), Leeds and von Allmen (2014), and the edited volumes of Humphreys and Howard (2008a, 2008b, 2008c) review sports economics and business issues.

Gorman and Calhoun (1994) and Rein, Shields, and Grossman (2015) fo- cus on alternative sources of revenue for sports teams and how these relate to business strategy. The business of baseball has been the subject of nu- merous volumes (Miller 1990; Zimbalist 1992; Powers 2003; Bradbury 2007; Pessah 2015). And Jozsa (2010) reviews the history of the National Basket- ball Association.

14 Sports Analytics and Data Science

An overview of sports marketing is provided by Mullin, Hardy, and Sut- ton (2014). Rein, Kotler, and Shields (2006) and Carter (2011) discuss the convergence of entertainment and sports. Miller (2015a) reviews methods in marketing data science, including product positioning maps, market seg- mentation, target marketing, customer relationship management, and com- petitive analysis.

Sports also represents a laboratory for labor market research. Sports is one of the few industries in which job performance and compensation are public knowledge. Economic studies examine player performance mea- sures and value of individual players to teams (Kahn 2000; Bradbury 2007). Miller (1991), Abrams (2010), and Lowenfish (2010) review baseball labor relations. And Early (2011) provides insight into labor and racial discrimi- nation in professional sports.

Sports wagering markets have been studied extensively by economists be- cause they provide public information about price, volume, and rates of re- turn. Furthermore, sports betting opportunities have fixed beginning and ending times and published odds or point spreads, making them easier to study than many financial investment opportunities. As a result, sports wagering markets have become a virtual field laboratory for the study of market efficiency. Sauer (1998) provides a comprehensive review of the economics of wagering markets.

When management objectives can be defined clearly in mathematical terms, teams use mathematical programming methods—constrained optimization. Teams attempt to maximize revenue or minimize costs subject to known situational factors. There has been extensive work on league schedules, for which the league objective may be to have teams playing one another an equal number of times while minimizing total distance traveled between cities. Alternatively, league officials may seek home/away schedules, rev- enue sharing formulas, or draft lottery rules that maximize competitive bal- ance. Briskorn (2008) reviews methods for scheduling sports competition, drawing on integer programming, combinatorics, and graph theory. Wright (2009) provides an overview of operations research in sport.

Chapter 1. Understanding Sports Markets 15

Extensive data about sports are in the public domain, readily available in newspapers and online sources. These data offer opportunities for predic- tive modeling and research. Throughout the book we also identify places to apply methods of operations research, including mathematical program- ming and simulation.

Exhibit 1.1 shows an R program for exploring distributions of player salaries across the MLB, NBA, and NFL. The program draws on software for statis- tical graphics from Sarkar (2008).

Exhibit 1.2 (page 18) shows an R program for examining the relationship between MLB payrolls and win-loss performance. The program draws on software for statistical graphics from Wickham and Chang (2014).

Exhibit 1.3 (page 19) shows an R program to obtain a perceptual map of seven sports, showing their relationships with one another. The program draws on modeling software for multidimensional scaling.

16 Sports Analytics and Data Science

Exhibit 1.1. MLB, NBA, and NFL Player Salaries (R)

# MLB, NBA, and NFL Player Salaries (R)

library(lattice) # statistical graphics

# variables in contract data from spotrac.com (August 2015)

# player: player name (contract years)

# position: position on team

# team: team abbreviation

# teamsignedwith: team that signed the original contract

# age: age in years as of August 2015

# years: years as player in league

# contract: dollars in contract

# guaranteed: guaranteed dollars in contract

# guaranteedpct: percentage of contract dollars guaranteed

# salary: annual salary in dollares

# yearfreeagent: year player becomes free agent

# additional created variables

# salarymm: salary in millions

# leaguename: full league name

# league: league abbreviation

# read data for Major League Baseball

mlb_contract_data <- read.csv("mlb_player_salaries_2015.csv")

mlb_contract_data$leaguename <- rep("Major League Baseball",

length = nrow(mlb_contract_data))

for (i in seq(along = mlb_contract_data$yearfreeagent))

if (mlb_contract_data$yearfreeagent[i] == 0)

mlb_contract_data$yearfreeagent[i] <- NA

for (i in seq(along = mlb_contract_data$age))

if (mlb_contract_data$age[i] == 0)

mlb_contract_data$age[i] <- NA

mlb_contract_data$salarymm <- mlb_contract_data$salary/1000000

mlb_contract_data$league <- rep("MLB", length = nrow(mlb_contract_data))

print(summary(mlb_contract_data))

# variables for plotting

mlb_data_plot <- mlb_contract_data[, c("salarymm","leaguename")]

nba_contract_data <- read.csv("nba_player_salaries_2015.csv")

nba_contract_data$leaguename <- rep("National Basketball Association",

length = nrow(nba_contract_data))

for (i in seq(along = nba_contract_data$yearfreeagent))

if (nba_contract_data$yearfreeagent[i] == 0)

nba_contract_data$yearfreeagent[i] <- NA

for (i in seq(along = nba_contract_data$age))

if (nba_contract_data$age[i] == 0)

nba_contract_data$age[i] <- NA

nba_contract_data$salarymm <- nba_contract_data$salary/1000000

nba_contract_data$league <- rep("NBA", length = nrow(nba_contract_data))

print(summary(nba_contract_data))

Chapter 1. Understanding Sports Markets 17

# variables for plotting

nba_data_plot <- nba_contract_data[, c("salarymm","leaguename")]

nfl_contract_data <- read.csv("nfl_player_salaries_2015.csv")

nfl_contract_data$leaguename <- rep("National Football League",

length = nrow(nfl_contract_data))

for (i in seq(along = nfl_contract_data$yearfreeagent))

if (nfl_contract_data$yearfreeagent[i] == 0)

nfl_contract_data$yearfreeagent[i] <- NA

for (i in seq(along = nfl_contract_data$age))

if (nfl_contract_data$age[i] == 0)

nfl_contract_data$age[i] <- NA

nfl_contract_data$salarymm <- nfl_contract_data$salary/1000000

nfl_contract_data$league <- rep("NFL", length = nrow(nfl_contract_data))

print(summary(nfl_contract_data))

# variables for plotting

nfl_data_plot <- nfl_contract_data[, c("salarymm","leaguename")]

# merge contract data with variables for plotting

plotting_data_frame <- rbind(mlb_data_plot, nba_data_plot, nfl_data_plot)

# generate the histogram lattice for comparing player salaries

# across the three leagues in this study

lattice_object <- histogram(~salarymm | leaguename, plotting_data_frame,

type = "density", xlab = "Annual Salary ($ millions)", layout = c(1,3))

# print to file

pdf(file = "fig_understanding_markets_player_salaries.pdf",

width = 8.5, height = 11)

print(lattice_object)

dev.off()

18 Sports Analytics and Data Science

Exhibit 1.2. Payroll and Performance in Major League Baseball (R)

# Payroll and Performance in Major League Baseball (R)

library(ggplot2) # statistical graphics

# functions used with grid graphics to split the plotting region

# to set margins and to plot more than one ggplot object on one page/screen

vplayout <- function(x, y)

viewport(layout.pos.row=x, layout.pos.col=y)

# user-defined function to plot a ggplot object with margins

ggplot.print.with.margins <- function(ggplot.object.name,

left.margin.pct=10,

right.margin.pct=10,top.margin.pct=10,bottom.margin.pct=10)

{ # begin function for printing ggplot objects with margins

# margins expressed as percentages of total... use integers

grid.newpage()

pushViewport(viewport(layout=grid.layout(100,100)))

print(ggplot.object.name,

vp=vplayout((0 + top.margin.pct):(100 - bottom.margin.pct),

(0 + left.margin.pct):(100 - right.margin.pct)))

} # end function for printing ggplot objects with margins

# read in payroll and performance data

# including annotation text for team abbreviations

mlb_data <- read.csv("mlb_payroll_performance_2014.csv")

mlb_data$millions <- mlb_data$payroll/1000000

mlb_data$winpercent <- mlb_data$wlpct * 100

cat("\nCorrelation between Payroll and Performance:\n")

with(mlb_data, print(cor(millions, winpercent)))

cat("\nProportion of win/loss percentage explained by payrolls:\n")

with(mlb_data, print(cor(millions, winpercent)^2))

pdf(file = "fig_understanding_markets_payroll_performance.pdf",

width = 5.5, height = 5.5)

ggplot_object <- ggplot(data = mlb_data,

aes(x = millions, y = winpercent)) +

geom_point(colour = "darkblue", size = 3) +

xlab("Team Payroll (Millions of Dollars)") +

ylab("Percentage of Games Won") +

geom_text(aes(label = textleft), size = 3, hjust = 1.3) +

geom_text(aes(label = textright), size = 3, hjust = -0.25)

ggplot.print.with.margins(ggplot_object, left.margin.pct = 5,

right.margin.pct = 5, top.margin.pct = 5, bottom.margin.pct = 5)

dev.off()

Chapter 1. Understanding Sports Markets 19

Exhibit 1.3. Making a Perceptual Map of Sports (R)

# Making a Perceptual Map of Sports (R)

library(MASS) # includes functions for multidimensional scaling

library(wordcloud) # textplot utility to avoid overlapping text

USE_METRIC_MDS <- FALSE # metric versus non-metric toggle

# utility function for converting a distance structure

# to a distance matrix as required for some routines and

# for printing of the complete matrix for visual inspection.

make.distance.matrix <- function(distance_structure)

{ n <- attr(distance_structure, "Size")

full <- matrix(0,n,n)

full[lower.tri(full)] <- distance_structure

full+t(full)

}

# enter data into a distance structure as required for various

# distance-based routines. That is, we enter the upper triangle

# of the distance matrix as a single vector of distances

distance_structure <-

as.single(c(9,11,10,5,14,4,15,6,12,13,16,1,18,2,20,7,3,19,17,8,21))

# provide a character vector of sports names

sport_names <- c("Baseball", "Basketball", "Football",

"Soccer", "Tennis", "Hockey", "Golf")

attr(distance_structure, "Size") <- length(sport_names) # set size attribute

# check to see that the distance structure has been entered correctly

# by converting the distance structure to a distance matrix

# using the utility function make.distance.matrix, which we had defined

distance_matrix <- unlist(make.distance.matrix(distance_structure))

cat("\n","Distance Matrix of Seven Sports","\n")

print(distance_matrix)

if (USE_METRIC_MDS)

{

# apply the metric multidimensional scaling algorithm and plot the map

mds_solution <- cmdscale(distance_structure, k=2, eig=T)

}

# apply the non-metric multidimensional scaling algorithm

# this is more appropriate for rank-order data

# and provides a more satisfactory solution here

if (!USE_METRIC_MDS)

{

mds_solution <- isoMDS(distance_matrix, k = 2, trace = FALSE)

}

20 Sports Analytics and Data Science

pdf(file = "plot_nonmetric_mds_seven_sports.pdf",

width=8.5, height=8.5) # opens pdf plotting device

# use par(mar = c(bottom, left, top, right)) to set up margins on the plot

par(mar=c(7.5, 7.5, 7.5, 5))

# original solution

First_Dimension <- mds_solution$points[,1]

Second_Dimension <- mds_solution$points[,2]

# set up the plot but do not plot points... use names for points

plot(First_Dimension, Second_Dimension, type = "n", cex = 1.5,

xlim = c(-15, 15), ylim = c(-15, 15)) # first page of pdf plots

# We plot the sport names in the locations where points normally go.

text(First_Dimension, Second_Dimension, labels = sport_names,

offset = 0.0, cex = 1.5)

title("Seven Sports (initial solution)")

# reflect the horizontal dimension

# multiply the first dimension by -1 to get reflected image

First_Dimension <- mds_solution$points[,1] * -1

Second_Dimension <- mds_solution$points[,2]

plot(First_Dimension, Second_Dimension, type = "n", cex = 1.5,

xlim = c(-15, 15), ylim = c(-15, 15)) # second page of pdf plots

text(First_Dimension, Second_Dimension, labels = sport_names,

offset = 0.0, cex = 1.5)

title("Seven Sports (horizontal reflection)")

# reflect the vertical dimension

# multiply the section dimension by -1 to get reflected image

First_Dimension <- mds_solution$points[,1]

Second_Dimension <- mds_solution$points[,2] * -1

plot(First_Dimension, Second_Dimension, type = "n", cex = 1.5,

xlim = c(-15, 15), ylim = c(-15, 15)) # third page of pdf plots

text(First_Dimension, Second_Dimension, labels = sport_names,

offset = 0.0, cex = 1.5)

title("Seven Sports (vertical reflection)")

# multiply the first and second dimensions by -1

# for reflection in both horizontal and vertical directions

First_Dimension <- mds_solution$points[,1] * -1

Second_Dimension <- mds_solution$points[,2] * -1

plot(First_Dimension, Second_Dimension, type = "n", cex = 1.5,

xlim = c(-15, 15), ylim = c(-15, 15)) # fourth page of pdf plots

text(First_Dimension, Second_Dimension, labels = sport_names,

offset = 0.0, cex = 1.5)

title("Seven Sports (horizontal and vertical reflection)")

dev.off() # closes the pdf plotting device

pdf(file = "plot_pretty_original_mds_seven_sports.pdf",

width=8.5, height=8.5) # opens pdf plotting device

# use par(mar = c(bottom, left, top, right)) to set up margins on the plot

par(mar=c(7.5, 7.5, 7.5, 5))

Chapter 1. Understanding Sports Markets 21

First_Dimension <- mds_solution$points[,1] # no reflection

Second_Dimension <- mds_solution$points[,2] # no reflection

# wordcloud utility for plotting with no overlapping text

textplot(x = First_Dimension,

y = Second_Dimension,

words = sport_names,

show.lines = FALSE,

xlim = c(-15, 15), # extent of horizontal axis range

ylim = c(-15, 15), # extent of vertical axis range

xaxt = "n", # suppress tick marks

yaxt = "n", # suppress tick marks

cex = 1.15, # size of text points

mgp = c(0.85, 1, 0.85), # position of axis labels

cex.lab = 1.5, # magnification of axis label text

xlab = "",

ylab = "")

dev.off() # closes the pdf plotting device

pdf(file = "fig_sports_perceptual_map.pdf",

width=8.5, height=8.5) # opens pdf plotting device

# use par(mar = c(bottom, left, top, right)) to set up margins on the plot

par(mar=c(7.5, 7.5, 7.5, 5))

First_Dimension <- mds_solution$points[,1] * -1 # reflect horizontal

Second_Dimension <- mds_solution$points[,2]

# wordcloud utility for plotting with no overlapping text

textplot(x = First_Dimension,

y = Second_Dimension,

words = sport_names,

show.lines = FALSE,

xlim = c(-15, 15), # extent of horizontal axis range

ylim = c(-15, 15), # extent of vertical axis range

xaxt = "n", # suppress tick marks

yaxt = "n", # suppress tick marks

cex = 1.15, # size of text points

mgp = c(0.85, 1, 0.85), # position of axis labels

cex.lab = 1.5, # magnification of axis label text

xlab = "First Dimension (Individual/Team, Degree of Contact)",

ylab = "Second Dimension (Anaerobic/Aerobic, Other")

dev.off() # closes the pdf plotting device

This page intentionally left blank

2 Assessing Players

Pete: “Gus, did you ever think in a million years computers would be a part of this game?” Gus: “Computers? Anyone uses computers doesn’t know a damn thing about this game.” Pete: “You know, if you wanted to, you could access any high school or college roster, pull the stats on any player at any time. You wouldn’t have to waste your time with all these papers.” Gus: “I’m not wasting my time. I enjoy doing this.” Pete: “You know, they got a special program now that can calculate a player’s stats and, based on the competition he’s seen, tell you whether or not he’s ready for the next level. You believe that?” Gus: “Yeah, what else does it tell you? When to scratch your ass?” Pete: “I don’t like them either, but they’re part of the business now.” Gus: “Pete, scouts, good scouts are the heart of this game. They decide who’s gonna play and, if they’re lucky, they decide how it’s gonna be played. But a computer can’t tell if a kid’s got instincts or not, or if he can hit a cut- off man, or hit behind the runner. . . or look into a kid’s face that’s just gone oh-for-four and know if he’s gonna be able to come back like nothing’s hap- pened. No, a computer can’t tell you all that crap, I’ll tell you. No.”

—JOHN GOODMAN AS PETE KLINE, AND CLINT EASTWOOD AS GUS LOBEL

IN Trouble with the Curve (2012)

24 Sports Analytics and Data Science

The job of sports performance analytics is to understand how various fac- tors contribute to success on the fields and courts of play, and this job be- gins with measurement. Many factors contribute to success in sport. There are measures that relate to physical stature, biophysics, health, fitness, and conditioning. There is athleticism and measures dealing with speed, power, strength, flexibility, and agility. There are psychological measures of intelli- gence, personality, and attitude. Finally, there are measures relating to pro- ficiency in sport—knowledge, skill, and execution in practice and in games.

Let us step back from the things we see and hear about on a regular basis— the language of sportscasting—and ask basic measurement questions. What do we want from performance measures in sports? What makes a measure reliable? What makes a measure valid?

We use the term reliability to refer to the trustworthiness or repeatability of measurement procedures. We consider the degree to which repeated measures of the same trait at the same time agree with one another, as in test-retest reliability or split-half reliability. When assessed with a multi- item survey, we ask that a measure have internal consistency.

When we talk about validity, we are thinking of the degree to which a mea- sure measures what it is supposed to measure. There are subjective assess- ments of face validity or content validity. We examine measurements to see the degree to which they appear to measure the traits they are supposed to measure.

A more objective approach to validity assessment would be to demonstrate predictive validity. Knowing how two traits are related in theory, we can create measures of those traits and examine the degree to which these mea- sures relate as theory suggests. The meaning of a measure is defined by its relationship to other measures. This is construct validity, a logical extension of predictive validity.

Campbell and Fiske (1959) define reliability and validity as follows:

Reliability is the agreement between two efforts to measure the same trait through maximally similar methods. Validity is represented in the agreement between two attempts to measure the same trait through maximally different methods. (83)

The prototypical measurement study involves the multitrait-multimethod matrix, as shown in figure 2.1

Chapter 2. Assessing Players 25

Figure 2.1. Multitrait-Multimethod Matrix for Baseball Measures

Method 1: Practice Measures

Method 2: Game-Day Measures

Trait A: Hitting Ability

Trait B: Power Hitting

Ability

Trait A: Hitting Ability

Trait B: Power Hitting

Ability

Method 1: Practice Measures

Method 2: Game-Day Measures

Trait A: Hitting Ability

Trait B: Power Hitting Ability

Trait A: Hitting Ability

0.90

0.80

0.800.30

0.40

0.20

0.45

0.50

Heterotrait- monomethod

triangle

Heterotrait- heteromethod

triangle

Validity diagonal

Reliability diagonal

Discriminant validity is demonstrated by the relative sizes of correlations in diagonals and triangles:

Method variance is demonstrated by relatively high heterotrait-monomethod correlations when traits are assumed to be uncorrelated.

Convergent validity is demonstrated by high correlations in the validity diagonal.

The meaning of a measure is defined by its relationships to other measures.

26 Sports Analytics and Data Science

The multitrait-multimethod matrix has rows and columns associated with traits (attributes) and methods (measurement procedures). Each element of the matrix represents a trait-method unit. The components of the ma- trix are the reliability diagonal, validity diagonal, heterotrait-monomethod triangles, and heterotrait-heteromethod triangles.

Figure 2.1 shows a hypothetical multitrait-multimethod matrix with four baseball measures. It portrays two underlying traits: general hitting ability and power hitting ability, measuring them both in practice and games. In batting practice, we first ask the player to hit each ball pitched as cleanly as he can and determine the proportion of balls hit in fair territory. Then we ask him to hit each ball pitched as far as he can and determine the pro- portion of balls hit out of the park, much as we would see in a home run derby. For game-day measures, we refer to box scores to obtain the player’s batting average and home run rate (home runs per at bat). This gives four distinct measures. We compute the correlations between all pairs of mea- sures and show the results in the matrix.

To assess reliability for training measures, we would use the same mea- surement procedures on numerous days and compute the average intercor- relation across all pairs of measures. And to assess reliability for game-day measures, we would compute correlations between measures from odd- numbered games with even-numbered games (or alternatively, we could compute correlations between measures across many random split-halves of games and average those correlations). Reliability refers to measures of the same trait in the same way at about the same time. The reliability of sports performance measures is very high because these are objective mea- sures based on counts. Aside from variations across official scorers, there is little subjectivity or opinion involved in these measures. We expect high correlations in the reliability diagonals of the multitrait-multimethod ma- trix.

What else do we expect to see in a multitrait-multimethod matrix? We should see different measures of the same trait correlating positively on the validity diagonal. Hitting in batting practice should have a positive correla- tion with hitting in games. We expect measures of the same trait to correlate more highly with one another than with measures of different traits. Ac- cordingly, we should see higher correlations on the validity diagonal than

Chapter 2. Assessing Players 27

in either the heterotrait-monomethod or the heterotrait-heteromethod tri- angles.

The meaning of a measure is defined by its relationships to other measures. This notion of construct validity is illustrated by the multitrait-multimethod matrix and what Campbell and Fiske (1959) call convergent validity and discriminant validity. Convergent validity refers to the idea that different measures of the same trait should converge. That is, different measures of the same trait or attribute should have relatively high correlations. Discrim- inant validity refers to the notion that measures of different traits should diverge. In other words, measures of different traits should have lower correlations than measures of the same trait. Convergent and discriminant validation are part of what we mean by construct validation. The meaning of a measure is defined in terms of its relationship to other measures.

What we expect in theory depends on the traits being measured. If we took running instead of general hitting ability as our first trait being studied, we would expect very different set of results. In practice, we might also measure a player’s time in the 40-yard dash, perhaps converting the time in milliseconds to miles per hour. For the game-day measures, we refer to box scores to obtain the player’s success rate in stolen bases (stolen bases divided by stolen bases attempted). What would we expect to see for these measures in relation to hitting for power? Running ability and hitting for power would be expected to be uncorrelated or negatively correlated.

Discussions of validity touch on fundamental issues in the philosophy of science—issues of theory construction, measurement, and testability. There are no easy answers here. If the theory is correct and the measures valid, then the pattern of relationships among the measures should be similar to the pattern predicted by theory. To the extent that this is true for observed data, we have partial confirmation of the theory and, at the same time, demonstration of construct validity. But what if the predictions do not pan out? Then we are faced with a dilemma: the theory could be wrong, one or more of the measures could be invalid, or we could have observed an event of low probability with correct theory and valid measures.

28 Sports Analytics and Data Science

Regarding measurement and philosophy of science, I like the umpire story:

After a long day of disputed calls at the ballpark, three umpires are asked to justify their methods. The first umpire, an empiri- cist by persuasion, says, I call them as I see them. The second, with the faith of a philosophical realist, replies, I call them as they are. Not to be outdone, the third umpire, with the self-proclaimed authority of an operationist or logical empiricist, says, The way I call them—that’s the way they are.

S. S. Stevens (1946) wrote On the Theory of Scales of Measurement, an influen- tial article identifying four general types of measures: nominal, ordinal, interval, and ratio. It was the strength of Stevens’ convictions, perhaps more than the strength of his argument, that influenced generations of re- searchers. The words he chose to describe levels of measurement seemed to carry the force of law. He talked about the formal properties of scales and “permissible statistics,” arguing that “the statistical manipulations that can legitimately be applied to empirical data depend on the type of scale” (Stevens 1946, 677). Stevens argued that we could compute means, stan- dard deviations, and correlations with interval and ratio measures, but not with ordinal measures.

Table 2.1 summarizes scale types or levels of measurement from Stevens (1946). The formal definition of a scale type follows from its mathematical properties, or what Stevens called its “mathematical group structure.” This refers to the set of data transformations that, when used on the original measures, will create new measures with the same scale properties as the original measures.

For nominal scales, any one-to-one transformation will preserve the num- ber of categories and, hence, the scale’s essential property. For ordinal scales, any one-to-one monotonic transformation will preserve the property of order. For interval scales, any one-to-one linear transformation, a func- tion of the form y = ax + b, will preserve the properties of the scale. Ratio scales are similar to interval scales, except that the zero point must be pre- served. Accordingly, for ratio scales, a data transformation that preserves its properties must have the form y = ax.

Chapter 2. Assessing Players 29

Table 2.1. Levels of Measurement

Level of Measurement (Scale Type)

Basic Empirical Operations

Mathematical Group Structure

Examples of Permissible Statistics

Nominal

Ordinal

Interval

Ratio

equality, numbers like names

greater than, less than

equality of intervals

equality of ratios

one-to-one correspondence

one-to-one monotonic

one-to-one linear

one-to-one linear, preserving the zero point

number of cases in class, frequency table, modal class

median, percentiles, rank-order correlation

mean, standard deviation, product-moment correlation

same statistics as interval level

Source: Adapted from Stevens (1946).

Researchers following Stevens’ dictums constitute the weak measurement school. They argue that many measures are ordinal rather than interval and that statistics relying on sums or differences, including means and vari- ances, would be inappropriate for ordinal measures. Researchers following the strong statistics school, on the other hand, argue that statistical methods make no explicit assumptions about the meaning of measurements or their relationships to underlying dimensions. Strong statistics can be used with weak measurements.

For practical purposes, we ask whether or not a variable has meaningful magnitude. If a variable is categorical, it lacks meaningful magnitude. One further observation is appropriate for categorical data: we note whether the variable is binary (taking only two possible values) or multinomial (taking more than two possible values). If we can make these simple distinctions across measures, we can do much useful research. Note that most sports performance measures begin as counts, which are ratio measures. And, despite the objections of weak measurement believers, there are many situ- ations in which computing the mean of ranks makes perfectly good sense.

30 Sports Analytics and Data Science

Many sports analysts confuse reliability with stability. In baseball, for ex- ample, they note low correlations for player batting averages from one year to the next or pitcher earned run averages from one year to the next, saying that this is evidence of low reliability. This is incorrect thinking. Baseball measures and sports performance measures in general have very high reli- ability.

Reliability concerns agreement between measures of the same trait in the same way at about the same time. The reliability of sports performance measures is very high because these are objective measures based on counts.

Many sports performance measures rely on the official scoring of events on the field, box scores, and play-by-play logs. Official records, counts, and mathematical formulas for computing performance measures do not change from one observer or one analyst to the next. Many of the newer measures of player and ball location on the fields and courts of play, player running speed, and efficiency in getting to balls in play are obtained through electronic devices with little or no human intervention. These are highly re- liable and trustworthy. What performance measures lack is not reliability, but stability from one year to the next or one game to the next.

Measurement is the assignment of numbers to attributes according to rules, and measurements themselves have certain desirable attributes:

Reliable. A measure should be trustworthy and repeatable. Valid. A measure should measure the attribute it is said to measure. Explicit. Procedures should be unambiguous and defined in detail, so that each research worker obtains the same values when using the measurement procedure. Accessible. A measure should come from data that are easily ob- tained. Tractable. A measure should be easy to work with and easy to utilize in methods and models. Comprehensible. A measure should be simple and straightforward, so it is easily understood and interpreted. Transparent. The method of measurement should be documented fully, so research workers can share results with one another in a spirit of open and honest scientific inquiry. There should be no trade secrets in science.

Chapter 2. Assessing Players 31

Sports performance measures vary widely in the degree to which they pos- sess these attributes. Some measures are dependent on timing or tracking devices in stadia, and may be accessible only to leagues and teams.

Regarding performance on the field, we record play-by-play events, com- pute box scores, and note standings of teams. We develop general measures of offensive and defensive performance and rate players and teams. Base- ball has its Sabermetrics and Moneyball. And other sports have followed suit, designing numerous measures of player performance and using them to make personnel decisions.

Psychometrics has “standardized testing,” defining explicit procedures that all test administrators must follow when making measures. Explicit, unam- biguous procedures promote reliability and reproducibility.

Measures in baseball serve to illustrate measurement principles. Batting average (BA) is a simple proportion with at bats as a devisor. We look for players with batting averages above 0.250, and batting at or above 0.300 is a goal of many hitters. Batting below 0.200, sometimes referred to as “the Mendoza line,” is not a good sign for hitters. Batting average is easily understood, but criticized as a measure of hitting ability because it fails to consider the value of walks. This is a concern about the measure’s validity.

On-base percentage (OBP) is very easy to explain. Using plate appearances rather than at bats in the divisor, OBP reflects the proportion of times that a hitter reaches first base or beyond. For OBP, we look for players whose values are around 0.333, getting on base one in every three plate appear- ances. OBP is well known and well understood, partly as a result of its use in Moneyball (Lewis 2003). But it is criticized because it fails to consider the value of extra-base hits. This, too, is a concern about measurement validity.

Slugging percentage (SLG) is a ratio: the number of total bases per at bat. It is not especially hard to understand, although it is not a percentage as its name would imply. And because it is neither a percentage nor an average, it is difficult to understand. A fan has to be “in the know” to understand that an SLG value of 0.300 as bad or a value of 0.500 as very good. Babe Ruth’s lifetime SLG was 0.690, which is very good. SLG suffers by being less comprehensible than other measures of hitting ability.

32 Sports Analytics and Data Science

On-base percentage plus slugging (OPS) is a simple sum of OBP and SLG. It is neither a percentage like OBP, nor a ratio like SLG. The intent of OPS was to provide an index of ability that would reflect both getting on base and hitting with power. OPS is another measure for which a person needs to be “in the know” to understand. The average regular-season OPS across the thirty MLB teams in the 2014 season was 0.700 (Sports Reference LLC 2015a). OPS makes little intuitive sense. There is no justification for weight- ing OBP and SLG equally in measuring hitting ability, and adding a propor- tion to a ratio gives a measure with no known units. Accordingly, OPS is neither comprehensible nor valid.

Tango, Lichtman, and Dolphin (2007) have created an alternative to OPS called the on-base average (OBA), computed as a weighted linear com- bination of various hitting measures. Their intent is to define a measure that gives reasonable, data-based weights to getting on base and hitting for power, a measure that is then scaled so it has values to conform to OBP. Tango, Lichtman, and Dolphin (2007) make a strong case for using OBA in- stead of OPS. But their method for calculating OBA is complicated.1 OBA cannot be easily explained in words, so it fails as a general index of hitting ability.

Simple, comprehensible measures are preferred to complex measures be- cause simple measures are easier to explain to fans, coaches, and managers. Among the most comprehensible measures are simple percentages or pro- portions computed from the events of a game.

In baseball, much time and effort has been devoted to attempts at finding the best single measure of player prowess. A five-tool player in baseball is a player with strong skills for running, fielding, throwing, hitting, and hitting with power. How can we combine measures of these five traits into a single measure reflecting a player’s contribution to his team?

1 Tango, Lichtman, and Dolphin (2007) provide the formula for weighted on-base average (OBA): wOBA = 0.72 × N I BB + 0.75 × HBP + 0.90 × 1B + 0.92 × RBOE + 1.24 × 2B + 1.56 × 3B + 1.95 × HR

PA where NIBB is the number of intentional bases on balls, HBP is number of times a player is hit by a pitch, 1B is the number of singles, RBOE is the number of times a batter reaches base on an error, 2B is the number of doubles, 3B is the number of triples, and HR is the number of home runs. PA refers to plate appearances, which may or may not exclude bunts, intentional bases on balls, and other events described as “obscure.”

Chapter 2. Assessing Players 33

Comprehensive player evaluation is illustrated by measures of points or wins above replacement. The general idea is to assess player abilities in hitting, fielding, base running, and throwing/pitching relative to a norm group (referred to as replacement players) and then to combine those norm- group-scaled assessments.

We ask, “What is a player’s value to his team? What if he were replaced by another player who is available to play, a player of average ability at the same position?” Wins above replacement is usually expressed in units of wins across the regular season, with ten runs being equivalent to one win. If a player’s wins-above-replacement value is 5, say, then that player’s team can expect to win five fewer games across the entire season if he must be replaced.

Wins-above-replacement measures fail the transparency test when methods of calculation are closely held company secrets. This is a special problem for norm-group-based measures because their meaning rests on the choice of norm group. If we do not know who the replacement players are, then we cannot accurately interpret wins-above-replacement. Furthermore, there is no way of checking the calculations of for-profit companies that refuse to publish their formulas and data. These measures are not in keeping with the spirit of scientific inquiry. They are neither comprehensible nor trans- parent and should be rejected by fans and teams.

For a transparent wins-above-replacement method, we can use openWAR from Baumer, Jensen, and Matthews (2015). Data and programs for this metric are in the public domain.

What about player performance over time? There are truisms in life, and one of those truisms is that the body ages. Much is understood about age effects in baseball and how to model them (Fair 2008). PECOTA, a well known measurement and prediction system, uses player-comparable age curves as its base data.2 Sadly, PECOTA is another method lacking in trans- parency.

2 We know about PECOTA from Silver (2004, 2012), who developed the method for a Baseball Prospectus in 2002–2003. Silver (2012) explains that the name comes from “a marginal infielder with the Kansas City Royals during the 1980s who was nevertheless a constant thorn in the side of my favorite Detroit Tigers.. . . Although Bill Pecota hit just .249 for his career overall, he hit .303 in games against the Tigers.” (Silver 2012, 88)

34 Sports Analytics and Data Science

Fortunately, there are alternatives to PECOTA. Teams desiring age-based measures can compute them directly, obtaining predictions about perfor- mance over the course of a player’s career (Albert and Bennett 2001; Marchi and Albert 2014). Some methods build on Bayesian inference (Albert 2009). Age-based models are most easily developed using tractable measures such as proportions.3 Age-based measures and predictions are especially useful in salary negotiations.

How do we go beyond individual player performance to look at a player’s contribution to his team? Therein lies a fundamental question in sports analytics. Anyone who knows sports knows that a good team is worth more than the sum of its parts. And it should come as no surprise that a dysfunctional team is worth less than the sum of its parts. This is to say that team effects should be considered when predicting winners and losers.

Baseball may be less susceptible to team effects than other sports. On a baseball diamond, Tinker-to-Evers-to-Chance works fine, even when Tin- kers, Evers, and Chance are not speaking with one another. Baseball is distinct from many other team sports in being defined by many individual matchups, one batter facing one pitcher, then another pitcher facing an- other batter—the events are discrete and easily identifiable as belonging to one player or another.

In most team sports, players complement one another. Some players are described as “team players,” because they help their teammates play better. Stockton and Malone worked together as a unit on the Utah Jazz. Their classic pick-and-roll made the Jazz a difficult team to beat for many years. Their individual performance measures from those years are inextricably intertwined (Oliver 2004).

By their very nature, basketball and many other team sports present special problems in evaluating individual players. A player’s value on one team

3 For an age-based measure, we could work with the proportion of hits in at bats (batting average BA), on-base percentage (OBP), or the proportion of home runs in at bats. We think of observations at time t as probabilities pt and employ the logit transform, using the log of the odds ratio. The resulting model takes the form of a logistic regression with a quadratic term in the linear predictor:

log ( pt

1 − pt

) = β0 + β1 xt + β2 xt

where xt is a player’s age at time t. With fitted parameters in hand (or posterior distributions for those parameters in a Bayesian context), we work backwards to obtain a player’s age curve. We find the value of the performance measure next year p(t+1) associated with the player’s age next year x(t+1).

Chapter 2. Assessing Players 35

may be quite different from his value on another team. This is revealed in the free-agency market where bids for a player vary widely across bidding teams. And this explains why player trades can benefit all teams involved.

Player evaluation has become a hot topic among the consumers of sport due in large measure to the growth in fantasy sports. Fantasy sports has an individual-player focus, with fantasy teams being little more than the sum of their player component parts. In this regard, fantasy sports are pure fantasy. They do not reflect the way players work together as a team.

Performance measures can be quite useful to coaches, managers, and own- ers. Measures of athleticism and sport proficiency are well documented in Martin (2016), with summary principles outlined in Martin and Miller (in press).

For references on measurement reliability and validity, we refer to the lit- erature of psychometrics (Gulliksen 1950; Cronbach 1951; Ghiselli 1964; Nunnally 1967; Nunnally and Bernstein 1994; Lord and Novick 1968; Fiske 1971; Brown 1976). Betz and Weiss (2001) and Allen and Yen (2002) intro- duce concepts of measurement theory. Item response theory is discussed by Rogers, Swaminathan, and Hambleton (1991). Articles in the volume edited by Shrout and Fiske (1995) provide many examples of multitrait- multimethod matrices and review quantitative methods available for the analysis of such matrices. Lumley (2010) discusses sample survey design and analysis in the R programming environment. For R functions in psy- chometrics, see Revelle (2014). Reviews of the weak measurements versus strong statistics controversy and its relevance (or lack of relevance) to sci- ence and statistical inference have been presented by Baker, Hardyck, and Petrinovich (1966) and Velleman and Wilkinson (1993).

Measurement theory applies equally well to text measures and designed surveys. We work with term frequencies within documents and term fre- quencies adjusted by overall corpus term frequencies. We assign scores to text messages and posts. We utilize methods of natural language process- ing to detect features in text and to annotate documents (Bird, Klein, and Loper 2009; Pustejovsky and Stubbs 2013). These are measurements.

This page intentionally left blank

3 Ranking Teams

Gillon: “So you saying that our boxing here in Diggstown is not to your satisfaction, Mister . . . ?” Caine: “Caine. Gabriel Caine?” Gillon: “John Gillon. Nice to meet you.” Caine: “Hey, can I be frank with you?” Gillon: “Please.” Caine: “It’s never too satisfying knowing who’s gonna win every time. You know what I mean? Take this mamluke in the white trunks over here. Half way through the first fight, I knew he’ll be kissing canvas. And varoom, he’s already done it twice. So what do you think? Is he going to kiss canvas a third time? Yes.” Gillon: “What you’re saying is that you think this man in the red trunks is going to win this fight.” Caine: “Is there like an acoustical problem in here? I didn’t say I think he’s going to win this fight. I said I know he’s going to win this fight. Hey, I gotta split. By the way, I’d bet a thousand on it.” Gillon: “But would you bet two thousand bucks on it?” Cain: “What? Are you kidding?” Gillon: “There’s two things we never joke about in Diggstown, Mr. Caine: our boxing and our betting.”

—BRUCE DERN AS GILLON AND JAMES WOODS AS CAINE IN Diggstown (1992)

38 Sports Analytics and Data Science

The meaning of a measure is defined by its relationships with other mea- sures, and the strength of a team is defined in relation to other teams. When picking winning teams, it is not sufficient to consider individual player statistics. We must see how teams compete with one another.

Measurement is the assignment of numbers to attributes according to rules. And just as we assign numbers to player performance attributes, we can as- sign numbers to teams. When working with team performance, we usually generate ratings and rankings.

There is debate about how to rank teams, especially when teams have lim- ited opportunity to play one another. Most of us remember the extensive controversies surrounding the Bowl Championship Series (BCS) for college football prior to the introduction of a limited playoff program. And there remains controversy about which teams should qualify for the playoffs due to strength-of-schedule differences across teams.

Compared with college athletics, professional team sports have more bal- ance across league schedules, but perfect balance does not exist. Divisions and conferences are not equal in player abilities, and teams play more of their games with other teams in their own conferences and divisions.

Homework is Completed By:

Writer	Writer Name	Amount	Client Comments & Rating
ONLINE	Instant Homework Helper 4.8 4305 Orders Completed	$36	She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up! 5.00
Answer.docx Turnitin Report.pdf Contact Writer For Solution Contact Writer For Solution

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 3 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 6 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 12 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

6 writers have sent their proposals to do this homework:

Writer	Writer Name	Offer	Chat
ONLINE	A Grade Exams I am a professional and experienced writer and I have written research reports, proposals, essays, thesis and dissertations on a variety of topics. 4.4 1785 Orders Completed	$41	Chat With Writer
ONLINE	Accounting Homework Help I have worked on wide variety of research papers including; Analytical research paper, Argumentative research paper, Interpretative research, experimental research etc. 4.9 1428 Orders Completed	$48	Chat With Writer
ONLINE	Top Grade Essay I have written research reports, assignments, thesis, research proposals, and dissertations for different level students and on different subjects. 4.9 2121 Orders Completed	$30	Chat With Writer
ONLINE	Fatimah Syeda I have done dissertations, thesis, reports related to these topics, and I cover all the CHAPTERS accordingly and provide proper updates on the project. 4.9 1302 Orders Completed	$27	Chat With Writer
ONLINE	Math Exam Success This project is my strength and I can fulfill your requirements properly within your given deadline. I always give plagiarism-free work to my clients at very competitive prices. 4.8 1239 Orders Completed	$36	Chat With Writer
ONLINE	Homework Tutor I have read your project description carefully and you will get plagiarism free writing according to your requirements. Thank You 4.9 1575 Orders Completed	$41	Chat With Writer