一篇根据乌鸦过去5年数据进行分析的统计学期末论文，或许可以解释Flacco的合同

惊涛拍岸 · 发表于 2013-5-18 13:33

简介一下背景，本人圣徒迷兼巨人迷兼乌鸦迷（原谅我的多情与不忠），就读纽约大学政治系，上学期学统计学，期末每人交一篇自主搜集数据来验证你的假设的统计论文。虽说身处政治系，但是统计论文不一定要求非要分析政治事件，看着别的同学为搜集各种政治数据而发愁，于是另辟蹊径想做体育，也就是NFL的数据分析，究其原因就是数据太好找了，自己又喜欢擅长。老师美国白人男性，职业政治数据分析师，副业教授，爱好棒球，完全不懂橄榄球，经交谈后同意我自主搜集写一篇NFL论文，但是要是写得让老师看不懂后果自负。
于是楼主惴惴不安地开始了搜集工作。
圣徒冠军太老了，巨人的数据不是很感兴趣，乌鸦自Flacco，Rice和教练08年来到队伍中，和以前可以说应该是两个阶段，Wikipedia上也把08年以来定义为新的Era，于是就搜集08赛季以来的比赛做统计。而且平心而论，08赛季以来最为稳定也表现最好的应该就是乌鸦了（当然，这是个人意见），比较好在论文中论述。比赛共计93场，不多不少，Observation够大，统计有效，但是又不是太多，省得麻烦（楼主傻大胆，很多同学四五百个observation）。因变量是比赛结果，dummy variable，胜一场为1，负一场为0. 四个自变量，每个都是橄榄球里最基本也就是最重要的数据：场均传球码数，场均冲球码数，场均丢传球码数以及场均丢冲球码数。使用软件为STATA12版。
一共做了除了基本OLS回归以外，还做了5种测试看看统计结果是否正确（当然，这也是教授的要求）。乌鸦的数据还挺给力，基本上没有什么干扰。
说说结果：乌鸦在过去5年中，每场是否能赢球主要靠冲球码数（statistical significant at 99% level），其次靠传球码数（statistical significant at 95% level），但是每场丢多少码数和比赛结果无关。换句话说，只要每场乌鸦的进攻组给力，防守组犯多大的错误都没关系，能弥补回来。
看上去有些反直觉的结果，却是被数学证明了的，楼主所用数学模型没问题，每项结果也都是测了5种test的，最后论文最后顺利拿A，不然也不敢在这里说。
过去5年里乌鸦赢球是在依靠进攻组，不知道对于过去5年来，以雷神为首的乌鸦一干防守老将们作何感想，因为这样的统计数据，每个队伍一定是天天在做，数据分析师们都是做这个的，我做的实在是自变量太少太少了，完全不能和职业队伍里的人分析师们去比。职业体育中数据不能说明一切，很多队员的作用也是数据体现不出来的，但是当数据体系原来越完善，所反映出的事实就越来越真切。这或许也就是乌鸦为什么会放心地给Flacco如此天价的合同的原因吧，一方面是赞赏，一方面是信任，一方面是鼓励。
08年选秀以及换教练，现在看来，对乌鸦今年的成功无疑是决定性的（所谓成功，就是成王败寇，冠军还不成功谁成功？）。这里也谈我个人看法，Flacco拿掉乌鸦大合同，标志着乌鸦将建队理念由防守组彻底转到进攻组，而进攻组明星极少，续约压力目前看来不大，Flacco大臂力抡起胳膊猛扔的打发（说的简单了点，但是抽象才能总结嘛），对外接手要求的更多的是速度，因此或许乌鸦以后的外接手就靠年轻人和新秀，谁名气大了要大合同就走人。防守组，让几位当打之年的老兵们拿着工资慢慢退出舞台之后，也就该重建了。
最后把论文附在这里，如果论坛里有大神能够指出统计模型或者是验证方法上的错误，感激不尽，如果有数学系的甚至是专业学统计的请指教，不说假话，因为毕竟分数已经打出来了嘛，做学生的，不就要个结果吗？能指出来，我能继续修改，统计这个东西，或许还要在我以后的工作中用一辈子。隐去姓名学号，其余对达阵联盟各位公布，没打算发表，但是要是想用的话，也还是引一下，毕竟是劳动成果~啰嗦一大堆，各位海涵~

                                                                                                               The Road to Championship
A QuantitativeStudy over the data of Super Bowl XLVII Champion Baltimore RavensWxxx Jxxxxxx
UID: xxxxxxxx
Wilf FamilyDepartment of PoliticsGraduate School of Artand ScienceNew York University

Abstract:Super Bowl XLVII Champion, Baltimore Ravens was one of the best NFL team in thepast 5 seasons. Since 2008, Ravens entered the NFL playoff five years consecutivelyand at least won one playoff game each season. Finally, the Baltimore Ravenswon the Super Bowl XLVII, 2013 at Louisiana. In this paper, I mainly collectedthe most basic and important four data (four independent variables) and see howthese variables influenced the Ravens games and what is the key part of thewinning games for the Ravens in the past 5 years.
1.    Introductionto the Research BackgroundThe Baltimore Ravens won the Super BowlXLVII on February 2nd, 2013 at New Orleans, Louisiana. The teamcoming from Baltimore was one of the best-performed team back in the last fiveseasons in the NFL. Since 2008, the Baltimore Ravens entered the playoff everyyear and at least won once each season. This year, they finally successfullyachieved the Super Bowl Champion. The reason I analyze the performance of theBaltimore Ravens after 2008 is that, in 2008, the Ravens hired a rookie Coachand drafted a rookie Quarterback and the Ravens since 2008 can be seen as a newperiod in the short history of the team. The Super Bowl winning in 2013 provedthe selection of Ravens’ managers were right and smart in 2008 since which yearthey started a new era for the football club.There is no secret for a team to win thegames. The data of the team and every individual athlete may tell youeverything. With record tapes of each game, sports analysts may find thereasons of winning and losing for every team. Overall and detailed analysiswill help a team to improve their performance.In this paper, I am going to researchthe most basic but important data in a Football game to analyze the reasons forthe good records of Ravens back in the five years. The first two data areoffensive data, the passing yards in a game and the rushing yards in a game.The other two are the defensive data, the passing yards allowed in a game (lostpassing yards) and the rushing yards allowed in a game (lost rushing yards).These four data are the most fundamentaldata in a game. Passing and rushing are the two basic ways of offense andscoring. Thus, passing yards and rushing yards are significant for a team towin a game. In defense, a defense group of a team should try their best to stopthe passing and rushing yards of their opponent. A team which cannot stop theiropponents is highly likely to lose.To sum up, generally, the higher valuesof passing and rushing yards are, the team has a better chance to win; thelower values of passing and rushing yards allowed are, the tame has a betterchance to win. In my paper, I will test the relations of these four data andthe records of the Baltimore Ravens in the past five years.
2.    DataCollection, Description and MeasurementI collected data from Pro FootballPreference, a professional football database, Baltimore Ravens Franchise Encyclopedia, http://www.pro-football-reference.com/teams/rav/ [1]. The dependent variable isthe winning and losing records of the Baltimore Ravens in the past fiveseasons. Here, I set this dependent variable as dummy variable, the value of awinning game is 1 and the value of a losing game is 0. This is a time-seriesdatabase recorded totally 93 games the Ravens attended from 2008-2013.As I presented in this paper earlier,there are four independent variables in my regression model and a chronologicalvariable, all the variables are presented in the following Table 1.
Table 1. Description of the Variables

Variables:	Measurement	Label
Winning and Losing Records	Game Records	Records
Time	Season/Week	Time
Passing Yards	Total Passing Yards per Game	Passing
Rushing Yards	Total Rushing Yards per Game	Rushing
Passing Yards Lost	Total Passing Yards Allowed per Game	Passlosing
Rushing Yards Lost	Total Rushing Yards Allowed per Game	Rushlosing

3. RegressionModels and Tests3.1 OLSRegression ModelI firstly propose a linear regressionmodel for the relations between four independent variables and dependentvariables:Records= α + β1passing + β2rushing + β3passlosing + β4rushlosing + ϵ file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image002.gifThe results of the regression are listedin the Table 2.Table 2. Results of OLS Regression Model

Variable	Coefficient	Standard Errors	P-value
Passing	.0015903	.0006045** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image004.gif	0.010
Rushing	.003225	.0008157*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image006.gif	0.000
Passlosing	-.0007787	.0005499	0.160
Rushlosing	-.0001939	.0010825	0.858
Constance	.1216627	.2420714	0.617
N	93
R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif	0.2262*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image010.gif
F-test	6.43

In this model, the R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image012.gif equals to 0.2262, this means that 20% of thevariance of variable records can be suggested by the model. The F-test is 6.43,larger than the critical value of (4, 93) = 2.4729, thus, the null hypothesis R2=0 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image014.gif may be rejected on 99% confidence level.Then, the predicted value of records inthis model should be:records=0.1216627 + 0.0015903passing**+ 0.003225rushing***-0.0007787passlosing-0.0001939rushlosing file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image016.gifThe conclusion I drew from this modelseems in favor of the Ravens’ offense team while the defense team played littlerole in deciding the result of game. Firstly, the passing yard of Ravens playssignificant role with 95% confidence level, which means at 95% confidencelevel, the more passing yards, the team will be more likely to win. Secondly,the rushing yards are more important for the Ravens to win. The rushing yard issignificant at 99% confidence level and is the most important indicator in theRavens games. Thirdly, the passing yard allowed and the rushing yard allowed inthe Ravens game is insignificant, which indicate that the defense group plays muchless important that the offense group since 2008. No matter what a mistake thedefense group makes, as long as the offense group performs well, the Ravens canstill win.This model may briefly suggest therelations between four independent variables and the dependent variables.However, there are still possibilities of problems in my regression model.Firstly, to avoid the problem of multicollinearity, the result of correlationbetween the independent variables, I have to do the test variance inflation factor.Secondly, autocorrelation is a common problem for time-series data, to avoidautocorrelation, I have to do the Durbin-Watson Test. Thirdly, the data maycontingent with each other and thus influence the final results, and I have totest the interaction terms for the effects. Fourth, unit root may affect thetrend of regression, to avoid unit root, I will run a dickey-fuller test. Fifth,the dependent variable of the regression is a dummy variable, as a result, Irun the logit regression for the database to see the chance of the Ravens towin in the last five seasons. Last but not least, endogeneity is also a commonproblem of the regression, however, in my regression, I argue that there is noendogeneity, my reasons will be presented later.3.2 Testof Multicollinearity.To test the multicollinearity betweenthese independent variables, I will test the Variance Inflation Factor (vif).The formula of the VIFi file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image018.gifofan independent variable is: vif= 11-Ri2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image020.gif The results of the VIFis file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image022.gif of the independent variables in the model arelisted below in Table 3.Table 3. VIFs of the IndependentVariables

Variable	VIF	1/VIF
Rushing	1.13	0.885783
Rushlosing	1.12	0.896598
Passlosing	1.04	0.960580
Passing	1.04	0.963291
Mean VIF	1.08

Because VIFi file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image024.gif indicates whether an independent variable canbe explained by other independent variables, the larger the VIF is, the moreproblems may exist in the regression model. Generally, the value of VIFi file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image024.gif should not be larger than 5, which means 80%of the variance of the independent variable may be explained by otherindependent variables.As we can see in the Table 3., thatthere is no multicollinearity in the regression model.3.3 Testof Autocorrelation.Autocorrelation is a common problem ofthe time-series regression model, which means the time series data have thecorrelation with itself (past or future). In time-series regression, the valuescan be correlated with each other in given time lag. Autocorrelation can be eitherpositive or negative.When the d-statistic value is less than2, the autocorrelation is positive. If d>dU file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image026.gif,then there is no autocorrelation; if d< dL file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image028.gif,then there is autocorrelation, if dL<d< dU file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image030.gif,then the result is inconclusive.When the d-statistic value is largerthan 2, the autocorrelation is negative. If (4-d>dU file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image032.gif,then there is no autocorrelation; if (4-d)< dL file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image034.gif,then there is autocorrelation, if dL<(4-d)< dU file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image036.gif,then the result is inconclusive.The data I collected is a time-seriesdata from 2008 to 2013, recording the past 93 games the Ravens attended. To seewhether there is autocorrelation in the model, I use the Durbin-Watson test tosee whether there is or is not autocorrelation.The result of Durbin-Watson test of themodel is that: Number of gaps in sample: 4. Durbin-Watson d-statistic (2, 88) =1.832428. There may be positive autocorrelation in the model. I then comparethe d-statistic value with the critical value. The critical value ofDurbin-Watson test at (2, 88): dL=1.63024, dU=1.67615 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image038.gif.Thus, d> dU. file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image040.gif There is no autocorrelation in the regressionmodel.3.4 Testof Interaction Terms.It is reasonable for to assume thatthere are interaction terms in the model and these interaction terms mayfinally affect the winning or losing of the game. Firstly, the passing yardsmay be contingent to the rushing yards. As we all know that, the time for theoffense group of a team is limited. Thus, the coach has to allocate the lengthof times on passing game and rushing game, as a result, the more passing game,the less rushing game and vice versa. Secondly, the passing yards allowed maybe contingent with rushing yards allowed in one game. Since the defense grouphas to rely on their experiences to predict and judge whether their opponentswill play passing or rushing, they have to put their main focus on onedefensive strategy. Then, each strategy may distract the attentions of theplayers on the other strategy. Moreover, the offense group and defense groupshare the 60 minutes length of game, the performance of one group may limit thetime of the other group to perform on the field. Thus, I totally generated 6 interactionterms.The 6 interactionterms are: passing_rushing=passing×rushing, passing_passlosing=passing×passlosing, passing_rushlosing=passing×rushl��sing, rushing_passlosing=rushing×passlosing, rushing_rushlosing=rushing×rushlosing, passlosing_rushlosing=passlosing×rushlosing. file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image042.gifThen the results of the revised modelare listed in Table 4.

Table4. Results of Variables in Revised Model

Variables	Coefficient	Standard Errors	P-value
Passing	.0038997	.0039558	0.327
Rushing	.003998	.0040113	0.322
Passlosing	-.0017751	.0031285	0.572
Rushlosing	.0026028	.0054157	0.632
Passing_rushing	-.0000114	.0000134	0.399
Passing_passlosing	7.96e-07	8.71e-06	0.927
Passing_rushlosing	-9.42e-06	.0000185	0.612
Rushing_passlosing	8.02e-06	.0000122	0.514
Rushing_rushlosing	-2.85e-06	.000023	0.902
Passlosing_rushlosing	-2.53e-06	.0000166	0.879
Constant	-.1889298	.9010632	0.834
N	93
R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif	0.2420
F-test	2.62

In the revised model, all theindependent variables are insignificant. We may assume that these resultsreflected the different players, sections and groups in the Ravens team areindependent and are difficult to be influenced by others performances andemotion. Everyone just focuses on his own work.To test whether my revised model has asignificant difference with the original model, I run the Chow-Test for thesecond model. The R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif for the first regression model was 0.2262, andthe R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif for the revised regression model is 0.2420.There were 5 variables (4 independent and 1 dependent) in the first regression,6 interaction terms in the revised model. The numbers of observation is 93.F93-5-66= (0.2420 – 0.2262)6(1-0.2420)(93-5-6)= 0.0158/60.758/82= 0.0026330.009244=0.284872 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image044.gifThe critical F-value for the alpha of0.05 and 6 and 82 degrees of freedom between is 2.2113. Thus, the value of0.284872 is smaller than the critical value. Therefore I cannot reject the nullhypothesis. The interaction effects are statistically insignificant.3.5 Testof Unit RootUnit Root is a common problem of time-seriesdatabase. To see whether there is unit root in the records of the Ravens, is tolook up to whether there is random walk. I run the augmented Dickey-Fuller testto see whether the dependent variable is stationary or non-stationary.The result of the dickey-fuller ispresented below.Table5. Results of Dickey-Fuller, Trend Regression Test

N of obs = 88	Z(t)
Test Statistic	-10.440
1% Critical Value	-3.527
5%Critical Value	-2.900
10% Critical Value	-2.585
Mackinnon approximate p-value for Z(t)	0.0000
L1.	-1.15857*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image046.gif	.111191

The estimated β=-1.15875<0 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image048.gif,thus, ρ=1+ β= -0.15875< -3.527 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image050.gif.Thus, we may reject the null hypothesis and there is no unit root in theregression model.3.6 LogitRegressionDummy dependent variable is suitable inthe Logit Regression because in the Logit regression model, it is more likelyfor the model to reflect the probability of the one value, here in thisregression model, the probability for the Ravens to win the games.Then, I run the logit regression of thedependent variable and the first four independent variables.Table6. Results of Logit Regression Model

	Coefficient	Standard Error	P-value
Passing	.0097464	.0040798** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image052.gif	0.017
Rushing	.0213477	.0061878*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image054.gif	0.001
Passlosing	-.004933	.0032167	0.125
Rushlosing	-.0011336	.0060437	0.851
Constant	-2.538214	1.428785	0.076
N	93
Pseudo R2 file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image008.gif	0.2158*** file:///C:\Users\Justin\AppData\Local\Temp\msohtmlclip1\01\clip_image056.gif
LR chi2(4)	25.24

As we can see in Table 6. the passingyard is still significant at the 95% level and the rushing yard is significantat the 99% level. Same with the results in the OLS model, passing yard allowedand rushing yard allowed are still insignificant.The same results in the logit regressionmodel show that since 2008, the Ravens highly and mainly rely on their itsoffense group to win the game, which is counterintuitive with the audiences andfans of the All-Star defense group of the Ravens.3.7 Argument of Test of Endogeneity.
Thetest of Endogeneity is an important test in statistical study. The result ofEndogeneity is that the dependent variable may also cause the variance andchange of the independent variable.However,this is a highly unlikely case in this regression model. I argue that thedependent variable: records of the winning and losing, is totally determined byother four independent variables indicating the performances of athletes andthe strategies of the coach groups. The result of a game was determined untilthe last second of the game, that is to say, this dependent variable has no wayto influence the performance of players and coaches. Thus, we may rightly arguethat there is no endogeneity in the regression model.4. ConclusionStatisticalmethods are significant in the analysis in politics, economy and other naturaland social sciences. The adoption of statistics in sports science and sportsanalysis is not new. In this paper, I collected a time-series data and adoptedsix statistical methods in analyzing exactly what helped the Baltimore Ravensto perform better than other teams in the past five seasons and finallyachieved the Super Bowl. The result of the research paper is satisfied. Thisteam mainly relied on its offense group to win the game. The disadvantage ofthe Ravens is its defense group. The team should build their defense team inthe future and remain their performance like the past five seasons.

[1] Pro Football Preference. Baltimore Ravens Franchise Encyclopedia.

Willc143 · 发表于 2013-5-18 13:37

传说中的技术贴

apheodite · 发表于 2013-5-18 14:29

彻底蒙圈。。。{:soso_e128:}

国士无双 · 发表于 2013-5-18 14:55

这是Paper啊，技术范

1048581555 · 发表于 2013-5-18 16:48

顶了慢慢看

THXnewnew · 发表于 2013-5-18 19:30

吊炸天，不明觉厉

Patriots_Fan · 发表于 2013-5-18 19:54

吊丝膜拜高端技术控!!

wasiwasi · 发表于 2013-5-18 19:56

定了再看

xy_mango · 发表于 2013-5-18 20:25

我等学渣看了十多行表示已经很自豪了

S.R.911 · 发表于 2013-5-18 21:38

膜拜～～～～

丁小洁 · 发表于 2013-5-18 22:02

神贴！五体投地了要！好好学习之！

惊涛拍岸 · 发表于 2013-5-18 23:27

Willc143 发表于 2013-5-18 13:37
传说中的技术贴

说有点技术也有点，说没有的话也可以，把编码记住，输入数据，一秒钟就算出来了。其实关键是有个想法而已。

惊涛拍岸 · 发表于 2013-5-18 23:27

apheodite 发表于 2013-5-18 14:29
彻底蒙圈。。。

{:soso_e113:}我的我的~

惊涛拍岸 · 发表于 2013-5-18 23:30

国士无双发表于 2013-5-18 14:55
这是Paper啊，技术范

说是Paper也行，说是作业也好。只是写完以后提心吊胆，怕老师看不懂就完蛋了。

惊涛拍岸 · 发表于 2013-5-18 23:31

xy_mango 发表于 2013-5-18 20:25
我等学渣看了十多行表示已经很自豪了

Mango兄捧场了，其实类似的分析也可以给Brees来圣徒以后做一个，只是Brees实在是用不着再用什么数据分析了，就该拿大合同，Flacco嘛，少不得得解释一下。

想飞的猫 · 发表于 2013-5-18 23:36

赞想法{:soso_e179:}

不知道楼主检验过多重共线性了没，攻防码数这样的变量估计很有可能是相关的。

另外，根据你的回归结果，胜负和防守无关正好说明乌鸦防守的稳定啊。防守总能把对手按住，只要进攻组给力就能赢球。

惊涛拍岸 · 发表于 2013-5-18 23:56

本帖最后由惊涛拍岸于 2013-5-18 23:59 编辑

想飞的猫发表于 2013-5-18 23:36
赞想法

不知道楼主检验过多重共线性了没，攻防码数这样的变量估计很有可能是相关的。

第一个测的就是multicolinearity，没有任何影响。不是说乌鸦防守稳定，statistical insignificant指的是和比赛结果无关，统计结果中随机性过大，无法做出结论，这是其数学含义，我们不能用现实来影响数学结论，只能解释得出的结论为什么会是这样。

sharpzyg · 发表于 2013-5-19 01:18

lzЩ
y100.51-1Чòɡ
y0á

sharpzyg · 发表于 2013-5-19 01:19

本帖最后由 sharpzyg 于 2013-5-19 01:20 编辑

lz好用心，方法没问题，但是结论确实有些意外。
个人觉得问题可能是在因变量y的取值上，赢一场1，输一场0，平均点在0.5，正系数的自变量影响力大于负的自变量；如果赢一场1，输一场-1，正值和负值的自变量效用才相等吧。
即使换了胜少负多的球队，犹豫y值始终大于0，正系数的变量还是要起主要作用，所以lz的模型始终会是正
进攻决定结果

惊涛拍岸 · 发表于 2013-5-19 01:50

本帖最后由惊涛拍岸于 2013-5-19 02:07 编辑

sharpzyg 发表于 2013-5-19 01:19
lz好用心，方法没问题，但是结论确实有些意外。
个人觉得问题可能是在因变量y的取值上，赢一场1，输一场0， ...

统计的方法，看似是定量，其实是定性。因为我们所做的普通统计样本太少，没法对比coefficient，所以无法对比衡量哪个自变量因素更大，如果是天文统计，几百万几千万个observation，就可以这么做了。
还是说到第一句，统计只是定性，是测每个自变量和因变量之间是正相关，负相关还是根本无关，和因变量如和赋值没有关系，因为不是简单统计，是回归，因此自变量取值是无关的。Dummy Variable是最有效的赋值系统，可以简单地把两个exclusive的选项用数值来衡量。
dummy variable是在其他广泛的统计模型中的应用是比比皆是的，其取值就是1和0，这也是统计学的规范与验证过的结果。举个例子，我是学政治的，所有政治统计里，一场战役是胜是负，也是用1 和 0来进行衡量的，你不能说因为战役比体育比赛更攸关国家，于是胜利就取1，失败就取-1。因为回归统计看的是变量之间的关系，不看你因变量取什么值。还有，美国每次大选，统计回归性别对于最后投票的影响，也是用dummy variable，男性为1，女性为0，或女性为1，男性为0，都一样。这里我们也不能说男性是1，女性是-1，变性人为0。再有，在两党制的国家里，一个党是1，另一个是0，不能说由于两个党的取值都不是负数，所以算出来每次都是取1的那个赢，这不现实也不科学。希望几个小例子能说明我的观点：担心1和0对结果的影响是不必的，dummy variable被广泛用在更严肃更科学的统计中，其取值是不影响结果的。

不过还是好问题，欢迎讨论~哈哈~{:soso_e181:}

帐号		自动登录	找回密码
密码			加入【达阵联盟】

[乌鸦] 一篇根据乌鸦过去5年数据进行分析的统计学期末论文，或许可以解释Flacco的合同

本帖子中包含更多资源

评分

点评

点评

点评

点评

点评

点评

点评