NFL 2013 Team Expected Points Added (EPA) per game - Defense by Offense
The atavistic love of sport, strategy, violence, speed, size, blood, team-work and gambling has made American Football irresistible entertainment for millions of folks worldwide. This years Super Bowl features the Denver Broncos (Bronx) with the best offense - rewriting the rule-book on passing and scoring points, and setting numerous NFL offensive records - versus the Seattle Seahawks (Hawks) with the best defense - a bunch of crazed Comanche warriors perfecting the dark art of assaulting receivers without getting penalized. Is it coincidence that both teams reside in states (Colorado and Washington) that recently legalized marijuana?
Gambling on professional sports is a billion dollar business where calculating the odds and using advanced data science techniques appropriately can make the difference between a second home in Aspen and the poor-house. Predicting the outcome of sport games is tricky and sports betting - like financial trading - has bankrupted many smart professional and amateur gamblers. There are many models purporting to predict outcomes based on "secret" formulas. These models calculate probabilities to allow you to bet favorable odds.
For example, the Prediction Machine claims it simulates a specific sports game 50,000 times and is built to consider all "relevant statistical interactions of the players (playing or not playing/injured), coaches, officials, fans (home-field advantage) and weather in each game." My friend Greg Arnold at FootballHack.com has built a model that simulates 10,000 games of "likely" game statistics, then runs those through a "linear regression model that predicts game scores."
However, sporting games, especially team sports, is a high causal density environment - meaning there are many potential causes to winning a game and it is at best difficult - and often impossible - to isolate the "real" outcome cause(s) notwithstanding confident forecasts slinging around terms like "regression models" and "simulations" and "projected score". Another problem in pro football is small sample size considering that teams only play 16 games in the regular season and a few top teams play from 1 to 4 games in the playoffs.
Yet these models may be useful if carefully and prudently constructed to understand game dynamics to help calculate the odds. There is an element of subjectivity and domain expertise in weighting the relevant importance of different factors. For example, in addition to traditional statistics (points per game scored and allowed) the weighting of the following variables matters in a professional football game: turnovers, weather, officiating and injuries. I suggest that each of these factors ought to be assigned a different weight pursuant to individual factors and team dynamics of each game.
When I ask for detailed information about how models are constructed I am almost always told they are the "secret sauce" and proprietary information. When I ask about the type and quality of data used I usually get vague and contradictory answers. So I have no way to judge the quality of data or the design and quality of the models. This is unacceptable when I am placing money at risk on a bet.
So I decided to build my own Super Bowl predictive model with a more complex set of variables using diverse data sources and high performance computers to virtually simulate many hypothetical games (Full disclosure: author is a Denver Bronco fan). Buy the ticket, take the ride. Why not - after all I am a betting man, a long-time passionate Bronx fan and building this sucker should be fun. And while my heart roots for a Bronx win - my head will bet against the Bronx in a New York second if that will make me money. Here is how I approached the challenge:
Traditional Statistics - Simple Models
During the regular season, the Hawks ranked first in pass defense, total defense and scoring defense while the Bronx were first in pass offense, total offense and scoring offense. The Bronx quarterback (QB) Peyton Manning broke the NFL record for yards in season (5,477) and touchdowns (55). On the other side, the Bronx were 19th in total defense, while the Hawks were 17th in total offense.
What does this mean for winning the Super Bowl? This is the sixth time (since 1970 AFL-NFL merger) that the best offense has met the best defense. Statistically speaking, the best defense has a record of 4 wins and 1 loss.
Further, 21 of 47 Super Bowls have featured a top five total offense against a top five total defense. There, the defense has beaten the offense 13 times for a 13 to 8 advantage for the great defense. Of the last 12 Super Bowls to feature a top five offense against a top five defense, the top defense is 9-3. Consider that the team with the number one total offense in the league has made the Super Bowl 15 times and is 8-7 while the team with the number one total defense in the league has made the Super Bowl 14 times and is 11-3.
From yet another angle, this Super Bowl has the top passing game versus the top passing defense and there have been 13 games where a top five passing offense facing a top five passing defense. There, the pass defense is 8-4-1 and the last four games with a top five pass defense versus top five pass offense has been won by the defense.
Put simply, defense appears to trump offense in the Super Bowl most of the time - all other things being equal. A simple model using traditional NFL statistics predicts a Hawks win. Sometimes simple models work best. Other times more complex models with more data work best. The small sample size of Super Bowl game data is problematic and creates serious doubt about obtaining a strong predictive signal.
The simple model calculates odds that favor the best defense and thus the Hawks should be favored to win. Yet the Bronx are - at this time - favored to win from anywhere between 1.5 and 2.5 points by most professional bookmakers. Why? The answer dear reader is that all other things are rarely equal and professional bookmakers intentionally skew their models to incent higher betting volumes to make more money.
Complex Models with More Data
In addition to using traditional statistics and building a simple model - lets add more data to the mix and take a quick look at various factors - and attempt to assign the proper weight for each variable in a more complex model:
The "turnover ratio" measures the difference between how many times a team turns the ball over (through either an interception or fumble) verse the number of turnovers it creates. This statistic is underrated and may be the most important factor in deciding the outcome of an NFL game. Statistics show that teams with positive turnover ratios have a significantly higher probability of winning. Thus, it should be assigned a heavy weight.
Winning teams usually have a positive turnover differential and are more likely to win the turnover battle.
During the 2013 NFL Season:
The Hawks had 39 takaways and 19 givaways for a plus 20 turnover differential.
The Bronx had 26 takaways and 26 givaways for a 0 turnover differential.
Over the past five years, teams with more takeaways than giveaways have a combined 810-220-2 (.786) record. During the 2013 NFL season teams with a positive turnover differential have a 72-17 (.809) record.
In last years Super Bowl, San Francisco and Baltimore both had a turnover ratio of +9. Baltimore beat San Francisco in turnovers (2-1) and won the game.
On the other hand, it is very difficult to predict who will win the turnover war on a game-to-game basis. Hawks get a slight edge here.
Injuries to both key and supporting players are likely to have a significant impact on the outcome and ought to be weighted according to type of injury and quality of player.
Bronx productive tail-back Knowshon Moreno suffered a chest injury in the previous game. Moreno is a high impact player and severity of his injury is key. Hawks star receiver Percy Harvin has been injured all year but may be available for the Super Bowl. A healthy Harvin can be a difference maker in favor of the Hawks. Without Harvin the Hawks are a below average offense.
Recent reports assert both Moreno and Harvin are fit to play. Yet teams often engage in deception and subterfuge to hide the true condition of players so inside information regarding injuries can provide significant advantage. Assume no inside information for the model. No edge for either team.
Weather is an underrated factor and players and teams react differently to various weather conditions. Considering that weather has the potential to have a significant impact on outcome we need to get in the weeds - especially because this is the first cold-weather Super Bowl venue in ages. Which team will be affected most by a bad weather forecast? Can we accurately forecast the weather beyond two days from game to feed into the model? Factors include temperature, wind-chill, snow, wind gusts and rain.
The Hawks shouldn't be too affected by bad weather because they play in a city with much bad weather and they are a team built to win with a ground attack and a great defense. On the other hand, the Bronx play in a sunny (more than 300 days of sunshine in Denver) dry climate with more nice weather than bad and win with a dynamic passing attack.
There is also debate about whether bad conditions advantage or disadvantage the offense of defense more. The answer is it depends on the type of bad weather and whether skill players can adjust. A school of thought - backed by some evidence - suggests that cold or snowy and rainy weather - without high winds - favors the offense because the offensive player knows where he is going and the defensive player has to react. Yet a cold day with high winds favor the defense because the passing game is disrupted.
The evidence suggests cold weather does have a negative effect on Bronx QB Peyton Manning - who played 15 regular season games in temperatures below 40 degrees. His career 65.5 percent completion percentage drops to 63.7 and he has 14 interceptions in those 15 games. In the playoffs Manning is a 56.4 percent passer in cold playoff games with nine interceptions in playoff games where the temperature is less than 40 degrees. In playoff games when the temperature drops below 40 degrees, Manning has an 0-4 record and a cumulative passer rating of 57.3, compared with an overall rating of 90.1 in playoff games. In his four cold-weather playoff games, Manning has thrown just four touchdown passes, nine interceptions and has never eclipsed 300 passing yards.
On the other hand, when it's just cold (below 40 degrees), Manning actually beats his QB peers (84.1 versus 78.9). It is well to consider that Manning actually has a higher winning percentage (.720) than he does indoors (.691) in temperatures in the 40s with low wind.
The wind may be more critical than the cold or snow or rain. All quarterbacks struggle in high-wind games - the average passer rating for a starting quarterback in these conditions since is 75.1. Manning has played in eight games in which the wind was gusting at more than 20 miles an hour. His quarterback rating in those games is just 68.0, compared with a rating of 84.1 in cold-weather games played in calmer conditions. When the wind is over 20 miles per hour Manning has troubles. In eight games in these conditions - five of them losses - Manning sports a 68 passer rating with just five touchdown passes and nine interceptions.
Yet the wind at MetLife Stadium (Super Bowl venue) may not be a significant factor considering it was designed to reduce the impact of the wind. NFL quarterbacks report that even on a windy day, throwing the ball accurately and with velocity is not hard.
Manning has a 3-7 career record, including playoffs, in games in which the temperature at kickoff was 32 degrees or below. In all other games, he's 170-76. A reasonable hypothesis is that cold and wind will more negatively affect his throws because he is older (37).
Early forecasts suggest snow with a high of 37 degrees with a low of 19. Yet it is difficult to accurately forecast weather beyond two days and thus the assigned weight ought to be low at this time (weather forecasts are more accurate closer to game day). Edge to Hawks.
Officiating can impact the outcome in many ways. For example, the Bronx offense likes to run "rub" plays to get receivers open. The rule book is vague about the difference between a "rub" which is legal and a "pick" which is not legal. This give the officials discretion and if the officials start flagging the Bronx for illegal "pick" plays - that were mostly deemed legal "rub" plays in the regular season and playoffs - the Bronx will be at a huge disadvantage - so much so that this change in interpretation would likely cost them the game.
On the other side the Hawks defensive secondary likes to get physical with receivers with aggressive use of hands. Officials have discretion about calling "holding" or "pass interference" and and if the officials start flagging the Hawks for their aggressive style of defense - discretion has mostly gone the Hawks way this season - this would likely cost them the game.
In this Super Bowl, official Terry McAulay lets players play and doesn't throw many flags. Umpire Carl Paganelli is lenient on calling offensive holding penalties - unless blatant. The back judge is Steve Freeman who lets receivers and defensive backs play so both defenses should be able to get away with aggressive hand-fighting in man-to-man coverage - officials in the playoffs and Super Bowl usually allow extra grabbing by defensive backs. In ten playoff games, only seven defensive pass interferences have been called and this should benefit the Hawks as the most aggressive and best defense in the league. Edge to Hawks.
The Complex Model
Factoring traditional offensive and defensive data in addition to assigning weights to the above variables - and feeding it into the model - and running thousands of simulations - produced a Hawks victory by the average score of 24 to 21 (rounded) sixty two percent (62%) of the time.
But I did not feel right about (1) small sample data size; and (2) the subjective nature of the weighting assigned to various factors. Therefore, I decided to tinker with the various assigned weights to different variables (i.e., turnovers, injuries, weather, officiating) and run more simulations.
Making only a slight adjustment to the turnovers variable changed the outcome to a Bronx victory a majority of the time. Making an adjustment to the injuries variable - where Harvin does not play - changed the outcome to a Bronx victory. An adjustment to the weather variable - where the temperature is under 20 degrees with strong wind gusts - changed the outcome to a Hawks victory.
Tweaking another adjustment to the weather variable - where the temperature is over 40 degrees with no wind - changed the outcome to a Bronx victory. Making an adjustment to the officiating variable - where the Bronx are flagged frequently for "rub" plays and thus unable to run crossing patterns - changed the outcome to a Hawks victory. Another adjustment to the officiating variable - where the Hawks are flagged frequently for defensive "holding" and "pass interference" and unable to mug Bronx receivers - changed the outcome to a Bronx victory by 17 points 58% of the time.
The complex model produces different results based on subjectively assigned weights to variables. Turnovers are essentially unpredictable game-to-game. Weather forecasts more than two days out are unreliable. Injuries are unpredictable - especially during the game.
As a result, the accuracy of the different models are in doubt and - while useful - not enough of a strong predictive signal to make a confident bet for this game. It appears this Super Bowl is a toss-up - meaning each team has an equal chance of winning and thus not a good game to place wagers.
My prediction is the Super Bowl outcome will depend on turnovers and weather. Win the turnover battle and you win about 80% of the time. Enjoy!
High Performance Computing (HPC) plus data science allows public and private organizations get actionable, valuable intelligence from massive volumes of data and use predictive and prescriptive analytics to make better decisions and create game-changing strategies. The integration of computing resources, software, networking, data storage, information management, and data scientists using machine learning and algorithms is the secret sauce to achieving the fundamental goal of creating durable competitive advantage.
HPC has evolved in the past decade to provide "supercomputing" capabilities at significantly lower costs. Modern HPC uses parallel processing techniques for solving complex computational problems. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both administration and parallel computational techniques.
HPC enables data scientists to address challenges that have been unmanageable in the past. HPC expands modeling and simulation capabilities, including using advanced data science techniques like random forests, monte carlo simulations, bayesian probability, regression, naive bayes, K-nearest neighbors, neural networks, decision trees and others.
Additionally, HPC allows an organization to conduct controlled experiments in a timely manner as well as conduct research for things that are too costly and time consuming to do experimentally. With HPC you can mathematically model and run numerical simulations to attempt to gain understanding via direct observation.
HPC technology today is implemented in multidisciplinary areas including:
• Finance and trading
• Oil and gas industry
• Electronic design automation
• Media and entertainment
• Geographical data
• Climate research
In the near future both public and private organizations in many domains will use HPC plus data science to boost strategic thinking, improve operations and innovate to create better services and products.
Prescriptive analytics automatically synthesizes big data, mathematical sciences, business rules, and machine learning to make predictions and then suggests decision options to take advantage of the predictions.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the decision maker the implications of each decision option. Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen.
Further, prescriptive analytics can suggest decision options on how to take advantage of a future opportunity or mitigate a future risk and illustrate the implication of each decision option. In practice, prescriptive analytics can continually and automatically process new data to improve prediction accuracy and provide better decision options.
Prescriptive analytics synergistically combines data, business rules, and mathematical models. The data inputs to prescriptive analytics may come from multiple sources, internal (inside the organization) and external (social media, et al.). The data may also be structured, which includes numerical and categorical data, as well as unstructured data, such as text, images, audio, and video data, including big data. Business rules define the business process and include constraints, preferences, policies, best practices, and boundaries. Mathematical models are techniques derived from mathematical sciences and related disciplines including applied statistics, machine learning, operations research, and natural language processing.
For example, prescriptive analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data combined with data of external factors such as economic data, population demographic trends and population health trends, to more accurately plan for future capital investments such as new facilities and equipment utilization as well as understand the trade-offs between adding additional beds and expanding an existing facility versus building a new one.
Another example is energy and utilities. Natural gas prices fluctuate dramatically depending upon supply, demand, econometrics, geo-politics, and weather conditions. Gas producers, transmission (pipeline) companies and utility firms have a keen interest in more accurately predicting gas prices so that they can lock in favorable terms while hedging downside risk. Prescriptive analytics can accurately predict prices by modeling internal and external variables simultaneously and also provide decision options and show the impact of each decision option.
See: Predictive, Descriptive, Prescriptive Analytics