|
简介一下背景,本人圣徒迷兼巨人迷兼乌鸦迷(原谅我的多情与不忠),就读纽约大学政治系,上学期学统计学,期末每人交一篇自主搜集数据来验证你的假设的统计论文。虽说身处政治系,但是统计论文不一定要求非要分析政治事件,看着别的同学为搜集各种政治数据而发愁,于是另辟蹊径想做体育,也就是NFL的数据分析,究其原因就是数据太好找了,自己又喜欢擅长。老师美国白人男性,职业政治数据分析师,副业教授,爱好棒球,完全不懂橄榄球,经交谈后同意我自主搜集写一篇NFL论文,但是要是写得让老师看不懂后果自负。
于是楼主惴惴不安地开始了搜集工作。
圣徒冠军太老了,巨人的数据不是很感兴趣,乌鸦自Flacco,Rice和教练08年来到队伍中,和以前可以说应该是两个阶段,Wikipedia上也把08年以来定义为新的Era,于是就搜集08赛季以来的比赛做统计。而且平心而论,08赛季以来最为稳定也表现最好的应该就是乌鸦了(当然,这是个人意见),比较好在论文中论述。比赛共计93场,不多不少,Observation够大,统计有效,但是又不是太多,省得麻烦(楼主傻大胆,很多同学四五百个observation)。因变量是比赛结果,dummy variable,胜一场为1,负一场为0. 四个自变量,每个都是橄榄球里最基本也就是最重要的数据:场均传球码数,场均冲球码数,场均丢传球码数以及场均丢冲球码数。使用软件为STATA12版。
一共做了除了基本OLS回归以外,还做了5种测试看看统计结果是否正确(当然,这也是教授的要求)。乌鸦的数据还挺给力,基本上没有什么干扰。
说说结果:乌鸦在过去5年中,每场是否能赢球主要靠冲球码数(statistical significant at 99% level),其次靠传球码数(statistical significant at 95% level),但是每场丢多少码数和比赛结果无关。换句话说,只要每场乌鸦的进攻组给力,防守组犯多大的错误都没关系,能弥补回来。
看上去有些反直觉的结果,却是被数学证明了的,楼主所用数学模型没问题,每项结果也都是测了5种test的,最后论文最后顺利拿A,不然也不敢在这里说。
过去5年里乌鸦赢球是在依靠进攻组,不知道对于过去5年来,以雷神为首的乌鸦一干防守老将们作何感想,因为这样的统计数据,每个队伍一定是天天在做,数据分析师们都是做这个的,我做的实在是自变量太少太少了,完全不能和职业队伍里的人分析师们去比。职业体育中数据不能说明一切,很多队员的作用也是数据体现不出来的,但是当数据体系原来越完善,所反映出的事实就越来越真切。这或许也就是乌鸦为什么会放心地给Flacco如此天价的合同的原因吧,一方面是赞赏,一方面是信任,一方面是鼓励。
08年选秀以及换教练,现在看来,对乌鸦今年的成功无疑是决定性的(所谓成功,就是成王败寇,冠军还不成功谁成功?)。这里也谈我个人看法,Flacco拿掉乌鸦大合同,标志着乌鸦将建队理念由防守组彻底转到进攻组,而进攻组明星极少,续约压力目前看来不大,Flacco大臂力抡起胳膊猛扔的打发(说的简单了点,但是抽象才能总结嘛),对外接手要求的更多的是速度,因此或许乌鸦以后的外接手就靠年轻人和新秀,谁名气大了要大合同就走人。防守组,让几位当打之年的老兵们拿着工资慢慢退出舞台之后,也就该重建了。
最后把论文附在这里,如果论坛里有大神能够指出统计模型或者是验证方法上的错误,感激不尽,如果有数学系的甚至是专业学统计的请指教,不说假话,因为毕竟分数已经打出来了嘛,做学生的,不就要个结果吗?能指出来,我能继续修改,统计这个东西,或许还要在我以后的工作中用一辈子。隐去姓名学号,其余对达阵联盟各位公布,没打算发表,但是要是想用的话,也还是引一下,毕竟是劳动成果~啰嗦一大堆,各位海涵~
The Road to Championship
A QuantitativeStudy over the data of Super Bowl XLVII Champion Baltimore RavensWxxx Jxxxxxx
UID: xxxxxxxx
Wilf FamilyDepartment of PoliticsGraduate School of Artand ScienceNew York University
Abstract:Super Bowl XLVII Champion, Baltimore Ravens was one of the best NFL team in thepast 5 seasons. Since 2008, Ravens entered the NFL playoff five years consecutivelyand at least won one playoff game each season. Finally, the Baltimore Ravenswon the Super Bowl XLVII, 2013 at Louisiana. In this paper, I mainly collectedthe most basic and important four data (four independent variables) and see howthese variables influenced the Ravens games and what is the key part of thewinning games for the Ravens in the past 5 years.
1. Introductionto the Research BackgroundThe Baltimore Ravens won the Super BowlXLVII on February 2nd, 2013 at New Orleans, Louisiana. The teamcoming from Baltimore was one of the best-performed team back in the last fiveseasons in the NFL. Since 2008, the Baltimore Ravens entered the playoff everyyear and at least won once each season. This year, they finally successfullyachieved the Super Bowl Champion. The reason I analyze the performance of theBaltimore Ravens after 2008 is that, in 2008, the Ravens hired a rookie Coachand drafted a rookie Quarterback and the Ravens since 2008 can be seen as a newperiod in the short history of the team. The Super Bowl winning in 2013 provedthe selection of Ravens’ managers were right and smart in 2008 since which yearthey started a new era for the football club.There is no secret for a team to win thegames. The data of the team and every individual athlete may tell youeverything. With record tapes of each game, sports analysts may find thereasons of winning and losing for every team. Overall and detailed analysiswill help a team to improve their performance.In this paper, I am going to researchthe most basic but important data in a Football game to analyze the reasons forthe good records of Ravens back in the five years. The first two data areoffensive data, the passing yards in a game and the rushing yards in a game.The other two are the defensive data, the passing yards allowed in a game (lostpassing yards) and the rushing yards allowed in a game (lost rushing yards).These four data are the most fundamentaldata in a game. Passing and rushing are the two basic ways of offense andscoring. Thus, passing yards and rushing yards are significant for a team towin a game. In defense, a defense group of a team should try their best to stopthe passing and rushing yards of their opponent. A team which cannot stop theiropponents is highly likely to lose.To sum up, generally, the higher valuesof passing and rushing yards are, the team has a better chance to win; thelower values of passing and rushing yards allowed are, the tame has a betterchance to win. In my paper, I will test the relations of these four data andthe records of the Baltimore Ravens in the past five years.
2. DataCollection, Description and MeasurementI collected data from Pro FootballPreference, a professional football database, Baltimore Ravens Franchise Encyclopedia, http://www.pro-football-reference.com/teams/rav/ [1]. The dependent variable isthe winning and losing records of the Baltimore Ravens in the past fiveseasons. Here, I set this dependent variable as dummy variable, the value of awinning game is 1 and the value of a losing game is 0. This is a time-seriesdatabase recorded totally 93 games the Ravens attended from 2008-2013.As I presented in this paper earlier,there are four independent variables in my regression model and a chronologicalvariable, all the variables are presented in the following Table 1.
Table 1. Description of the Variables
Variables:
| Measurement
| Label
| Winning and Losing Records
| Game Records
| Records
| Time
| Season/Week
| Time
| Passing Yards
| Total Passing Yards per Game
| Passing
| Rushing Yards
| Total Rushing Yards per Game | Rushing
| Passing Yards Lost
| Total Passing Yards Allowed per Game
| Passlosing
| Rushing Yards Lost
| Total Rushing Yards Allowed per Game | Rushlosing
| 3. RegressionModels and Tests3.1 OLSRegression ModelI firstly propose a linear regressionmodel for the relations between four independent variables and dependentvariables:Records= α + β1passing + β2rushing + β3passlosing + β4rushlosing + ϵ file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image002.gifThe results of the regression are listedin the Table 2.Table 2. Results of OLS Regression Model
Variable
| Coefficient
| Standard Errors
| P-value
| Passing
| .0015903
| .0006045** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image004.gif | 0.010
| Rushing
| .003225
| .0008157*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image006.gif | 0.000
| Passlosing
| -.0007787
| .0005499
| 0.160
| Rushlosing
| -.0001939
| .0010825
| 0.858
| Constance
| .1216627
| .2420714
| 0.617
| N
| 93
|
|
| R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif | 0.2262*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image010.gif |
|
| F-test
| 6.43
|
|
| In this model, the R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image012.gif equals to 0.2262, this means that 20% of thevariance of variable records can be suggested by the model. The F-test is 6.43,larger than the critical value of (4, 93) = 2.4729, thus, the null hypothesis R2=0 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image014.gif may be rejected on 99% confidence level.Then, the predicted value of records inthis model should be:records=0.1216627 + 0.0015903passing**+ 0.003225rushing***-0.0007787passlosing-0.0001939rushlosing file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image016.gifThe conclusion I drew from this modelseems in favor of the Ravens’ offense team while the defense team played littlerole in deciding the result of game. Firstly, the passing yard of Ravens playssignificant role with 95% confidence level, which means at 95% confidencelevel, the more passing yards, the team will be more likely to win. Secondly,the rushing yards are more important for the Ravens to win. The rushing yard issignificant at 99% confidence level and is the most important indicator in theRavens games. Thirdly, the passing yard allowed and the rushing yard allowed inthe Ravens game is insignificant, which indicate that the defense group plays muchless important that the offense group since 2008. No matter what a mistake thedefense group makes, as long as the offense group performs well, the Ravens canstill win.This model may briefly suggest therelations between four independent variables and the dependent variables.However, there are still possibilities of problems in my regression model.Firstly, to avoid the problem of multicollinearity, the result of correlationbetween the independent variables, I have to do the test variance inflation factor.Secondly, autocorrelation is a common problem for time-series data, to avoidautocorrelation, I have to do the Durbin-Watson Test. Thirdly, the data maycontingent with each other and thus influence the final results, and I have totest the interaction terms for the effects. Fourth, unit root may affect thetrend of regression, to avoid unit root, I will run a dickey-fuller test. Fifth,the dependent variable of the regression is a dummy variable, as a result, Irun the logit regression for the database to see the chance of the Ravens towin in the last five seasons. Last but not least, endogeneity is also a commonproblem of the regression, however, in my regression, I argue that there is noendogeneity, my reasons will be presented later.3.2 Testof Multicollinearity.To test the multicollinearity betweenthese independent variables, I will test the Variance Inflation Factor (vif).The formula of the VIFi file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image018.gifofan independent variable is: vif= 11-Ri2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image020.gif The results of the VIFis file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image022.gif of the independent variables in the model arelisted below in Table 3.Table 3. VIFs of the IndependentVariables Variable
| VIF
| 1/VIF
| Rushing
| 1.13
| 0.885783
| Rushlosing
| 1.12
| 0.896598
| Passlosing
| 1.04
| 0.960580
| Passing
| 1.04
| 0.963291
| Mean VIF
| 1.08
|
| Because VIFi file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image024.gif indicates whether an independent variable canbe explained by other independent variables, the larger the VIF is, the moreproblems may exist in the regression model. Generally, the value of VIFi file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image024.gif should not be larger than 5, which means 80%of the variance of the independent variable may be explained by otherindependent variables.As we can see in the Table 3., thatthere is no multicollinearity in the regression model.3.3 Testof Autocorrelation.Autocorrelation is a common problem ofthe time-series regression model, which means the time series data have thecorrelation with itself (past or future). In time-series regression, the valuescan be correlated with each other in given time lag. Autocorrelation can be eitherpositive or negative.When the d-statistic value is less than2, the autocorrelation is positive. If d>dU file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image026.gif,then there is no autocorrelation; if d< dL file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image028.gif,then there is autocorrelation, if dL<d< dU file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image030.gif,then the result is inconclusive.When the d-statistic value is largerthan 2, the autocorrelation is negative. If (4-d>dU file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image032.gif,then there is no autocorrelation; if (4-d)< dL file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image034.gif,then there is autocorrelation, if dL<(4-d)< dU file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image036.gif,then the result is inconclusive.The data I collected is a time-seriesdata from 2008 to 2013, recording the past 93 games the Ravens attended. To seewhether there is autocorrelation in the model, I use the Durbin-Watson test tosee whether there is or is not autocorrelation.The result of Durbin-Watson test of themodel is that: Number of gaps in sample: 4. Durbin-Watson d-statistic (2, 88) =1.832428. There may be positive autocorrelation in the model. I then comparethe d-statistic value with the critical value. The critical value ofDurbin-Watson test at (2, 88): dL=1.63024, dU=1.67615 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image038.gif.Thus, d> dU. file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image040.gif There is no autocorrelation in the regressionmodel.3.4 Testof Interaction Terms.It is reasonable for to assume thatthere are interaction terms in the model and these interaction terms mayfinally affect the winning or losing of the game. Firstly, the passing yardsmay be contingent to the rushing yards. As we all know that, the time for theoffense group of a team is limited. Thus, the coach has to allocate the lengthof times on passing game and rushing game, as a result, the more passing game,the less rushing game and vice versa. Secondly, the passing yards allowed maybe contingent with rushing yards allowed in one game. Since the defense grouphas to rely on their experiences to predict and judge whether their opponentswill play passing or rushing, they have to put their main focus on onedefensive strategy. Then, each strategy may distract the attentions of theplayers on the other strategy. Moreover, the offense group and defense groupshare the 60 minutes length of game, the performance of one group may limit thetime of the other group to perform on the field. Thus, I totally generated 6 interactionterms.The 6 interactionterms are: passing_rushing=passing×rushing, passing_passlosing=passing×passlosing, passing_rushlosing=passing×rushl��sing, rushing_passlosing=rushing×passlosing, rushing_rushlosing=rushing×rushlosing, passlosing_rushlosing=passlosing×rushlosing. file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image042.gifThen the results of the revised modelare listed in Table 4.
Table4. Results of Variables in Revised Model Variables
| Coefficient
| Standard Errors
| P-value
| Passing
| .0038997
| .0039558
| 0.327
| Rushing
| .003998
| .0040113
| 0.322
| Passlosing
| -.0017751
| .0031285
| 0.572
| Rushlosing
| .0026028
| .0054157
| 0.632
| Passing_rushing
| -.0000114
| .0000134
| 0.399
| Passing_passlosing
| 7.96e-07
| 8.71e-06
| 0.927
| Passing_rushlosing
| -9.42e-06
| .0000185
| 0.612
| Rushing_passlosing
| 8.02e-06
| .0000122
| 0.514
| Rushing_rushlosing
| -2.85e-06
| .000023
| 0.902
| Passlosing_rushlosing
| -2.53e-06
| .0000166
| 0.879
| Constant
| -.1889298
| .9010632
| 0.834
| N
| 93
|
|
| R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif | 0.2420
|
|
| F-test
| 2.62
|
|
| In the revised model, all theindependent variables are insignificant. We may assume that these resultsreflected the different players, sections and groups in the Ravens team areindependent and are difficult to be influenced by others performances andemotion. Everyone just focuses on his own work.To test whether my revised model has asignificant difference with the original model, I run the Chow-Test for thesecond model. The R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif for the first regression model was 0.2262, andthe R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif for the revised regression model is 0.2420.There were 5 variables (4 independent and 1 dependent) in the first regression,6 interaction terms in the revised model. The numbers of observation is 93.F93-5-66= (0.2420 – 0.2262)6(1-0.2420)(93-5-6)= 0.0158/60.758/82= 0.0026330.009244=0.284872 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image044.gifThe critical F-value for the alpha of0.05 and 6 and 82 degrees of freedom between is 2.2113. Thus, the value of0.284872 is smaller than the critical value. Therefore I cannot reject the nullhypothesis. The interaction effects are statistically insignificant.3.5 Testof Unit RootUnit Root is a common problem of time-seriesdatabase. To see whether there is unit root in the records of the Ravens, is tolook up to whether there is random walk. I run the augmented Dickey-Fuller testto see whether the dependent variable is stationary or non-stationary.The result of the dickey-fuller ispresented below.Table5. Results of Dickey-Fuller, Trend Regression Test N of obs = 88
| Z(t)
|
| Test Statistic
| -10.440
|
| 1% Critical Value
| -3.527
|
| 5%Critical Value
| -2.900
|
| 10% Critical Value | -2.585
|
| Mackinnon approximate p-value for Z(t) | 0.0000
|
| L1.
| -1.15857*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image046.gif | .111191
| The estimated β=-1.15875<0 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image048.gif,thus, ρ=1+ β= -0.15875< -3.527 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image050.gif.Thus, we may reject the null hypothesis and there is no unit root in theregression model.3.6 LogitRegressionDummy dependent variable is suitable inthe Logit Regression because in the Logit regression model, it is more likelyfor the model to reflect the probability of the one value, here in thisregression model, the probability for the Ravens to win the games.Then, I run the logit regression of thedependent variable and the first four independent variables.Table6. Results of Logit Regression Model
| Coefficient
| Standard Error
| P-value
| Passing
| .0097464
| .0040798** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image052.gif | 0.017
| Rushing
| .0213477
| .0061878*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image054.gif | 0.001
| Passlosing
| -.004933
| .0032167
| 0.125
| Rushlosing
| -.0011336
| .0060437
| 0.851
| Constant
| -2.538214
| 1.428785
| 0.076
| N
| 93
|
|
| Pseudo R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif | 0.2158*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image056.gif |
|
| LR chi2(4)
| 25.24
|
|
| As we can see in Table 6. the passingyard is still significant at the 95% level and the rushing yard is significantat the 99% level. Same with the results in the OLS model, passing yard allowedand rushing yard allowed are still insignificant.The same results in the logit regressionmodel show that since 2008, the Ravens highly and mainly rely on their itsoffense group to win the game, which is counterintuitive with the audiences andfans of the All-Star defense group of the Ravens.3.7 Argument of Test of Endogeneity.
Thetest of Endogeneity is an important test in statistical study. The result ofEndogeneity is that the dependent variable may also cause the variance andchange of the independent variable.However,this is a highly unlikely case in this regression model. I argue that thedependent variable: records of the winning and losing, is totally determined byother four independent variables indicating the performances of athletes andthe strategies of the coach groups. The result of a game was determined untilthe last second of the game, that is to say, this dependent variable has no wayto influence the performance of players and coaches. Thus, we may rightly arguethat there is no endogeneity in the regression model.4. ConclusionStatisticalmethods are significant in the analysis in politics, economy and other naturaland social sciences. The adoption of statistics in sports science and sportsanalysis is not new. In this paper, I collected a time-series data and adoptedsix statistical methods in analyzing exactly what helped the Baltimore Ravensto perform better than other teams in the past five seasons and finallyachieved the Super Bowl. The result of the research paper is satisfied. Thisteam mainly relied on its offense group to win the game. The disadvantage ofthe Ravens is its defense group. The team should build their defense team inthe future and remain their performance like the past five seasons.
[1] Pro Football Preference. Baltimore Ravens Franchise Encyclopedia.
|
评分
-
查看全部评分
|