# Mathbb

a blog about statistics, probability and logic of science

## Friday, September 23, 2016

### ECML 2016 tutorial on Bayesian vs. Frequentist tests for comparing algorithms

### Clinton vs. Trump 23th Sptember 2016

*Python*code that computes the worst-case (red) and best-Case (blue) posterior distribution for

__Clinton__winning the general USA election. using

*fresh*(September) poll-data. At the moment there is a quite large uncertainty but is still in favour of Clinton: the probability of winning is

**between 0.78 and 0.95**. If you are interested in the methodology I have used to compute these distributions please see the past posts. If you want to try yourself, Python code and data-poll from five

**thirtyeight**.com are available in my github (just click on the links).

## Friday, September 16, 2016

### 19 September Tutorial at ECML

**G. Corani, A. Benavoli, J. Demsar. Comparing competing algorithms: Bayesian versus frequentist hypothesis testing**

### Schedule

Time | Duration | Content | Details |
---|---|---|---|

09:00 | 15min | Introduction | Motivations and Goals |

09:15 | 60min | Null hypothesis significance tests in machine learning | NHST testing (methods and drawbacks) |

10:15 | 25min | Introduction to Bayesian tests | Bayesian model comparison versus Bayesian estimation |

10:40 | 20min | Break | Is the coffee in Riva del Garda better than the coffee in Porto? |

11:00 | 35min | Bayesian hypothesis testing for comparing classifiers | Single and hierarchical Bayesian models |

11:35 | 55min | Non-parametric Bayesian tests and presentation of the results of Bayesian analysis | Dirichlet process and how to perform nonparametric Bayesian tests |

12:30 | 10min | Summarizing! | Summary and conclusions |

## Tuesday, August 30, 2016

### General Poll for US Presidential Election 2016

We continue our adventure in the Bayesian USA 2016 election forecast through near-ignorance priors.

I will today show how to compute the lower and upepr probabilities for Clinton of winning the general election 2016. First, we load the lower and upper probabilities for Clinton of winning in every single State (see http://idpstat.blogspot.ch/2016/08/bayesian-winning-lower-and-upper.html) as well as the Electoral Vore for each state

```
import pandas as pd
lowerupper = pd.read_csv('LowerUpper.csv')
electoralvotes = pd.read_csv('electoralvotes.csv')
```

```
lowerupper
```

Unnamed: 0 | LowerProbability | UpperProbability | |
---|---|---|---|

0 | 0 | 0.0026 | 0.0076 |

1 | 1 | 0.0155 | 0.0314 |

2 | 2 | 0.0367 | 0.0762 |

3 | 3 | 0.0690 | 0.1759 |

4 | 4 | 1.0000 | 1.0000 |

5 | 5 | 0.9034 | 0.9500 |

6 | 6 | 0.9937 | 0.9980 |

7 | 7 | 0.9950 | 0.9988 |

8 | 8 | 0.9850 | 0.9949 |

9 | 9 | 0.9459 | 0.9689 |

10 | 10 | 0.1747 | 0.2734 |

11 | 11 | 1.0000 | 1.0000 |

12 | 12 | 0.0013 | 0.0170 |

13 | 13 | 1.0000 | 1.0000 |

14 | 14 | 0.0003 | 0.0019 |

15 | 15 | 0.4979 | 0.6543 |

16 | 16 | 0.0098 | 0.0339 |

17 | 17 | 0.1462 | 0.2677 |

18 | 18 | 0.0081 | 0.0232 |

19 | 19 | 0.9931 | 0.9966 |

20 | 20 | 1.0000 | 1.0000 |

21 | 21 | 1.0000 | 1.0000 |

22 | 22 | 0.9549 | 0.9755 |

23 | 23 | 0.9774 | 0.9924 |

24 | 24 | 0.0006 | 0.0012 |

25 | 25 | 0.3789 | 0.5690 |

26 | 26 | 0.1162 | 0.3647 |

27 | 27 | 0.1013 | 0.2727 |

28 | 28 | 0.5729 | 0.7442 |

29 | 29 | 0.8637 | 0.9293 |

30 | 30 | 0.9999 | 1.0000 |

31 | 31 | 0.9890 | 0.9962 |

32 | 32 | 1.0000 | 1.0000 |

33 | 33 | 0.5132 | 0.6745 |

34 | 34 | 0.0003 | 0.0009 |

35 | 35 | 0.9250 | 0.9622 |

36 | 36 | 0.0000 | 0.0000 |

37 | 37 | 0.8899 | 0.9510 |

38 | 38 | 0.6900 | 0.8112 |

39 | 39 | 1.0000 | 1.0000 |

40 | 40 | 0.1803 | 0.3053 |

41 | 41 | 0.0448 | 0.0985 |

42 | 42 | 0.0079 | 0.0252 |

43 | 43 | 0.0000 | 0.0008 |

44 | 44 | 0.0001 | 0.0003 |

45 | 45 | 1.0000 | 1.0000 |

46 | 46 | 0.9994 | 0.9998 |

47 | 47 | 0.9996 | 1.0000 |

48 | 48 | 0.0712 | 0.2090 |

49 | 49 | 0.8702 | 0.9318 |

50 | 50 | 0.0000 | 0.0000 |

```
electoralvotes
```

Index | State | Vote | |
---|---|---|---|

0 | 1 | Alabama | 9 |

1 | 2 | Alaska | 3 |

2 | 3 | Arizona | 11 |

3 | 4 | Arkansas | 6 |

4 | 5 | California | 55 |

5 | 6 | Colorado | 9 |

6 | 7 | Connecticut | 7 |

7 | 8 | Delaware | 3 |

8 | 9 | D.C. | 3 |

9 | 10 | Florida | 29 |

10 | 11 | Georgia | 16 |

11 | 12 | Hawaii | 4 |

12 | 13 | Idaho | 4 |

13 | 14 | Illinois | 20 |

14 | 15 | Indiana | 11 |

15 | 16 | Iowa | 6 |

16 | 17 | Kansas | 6 |

17 | 18 | Kentucky | 8 |

18 | 19 | Louisiana | 8 |

19 | 20 | Maine | 4 |

20 | 21 | Maryland | 10 |

21 | 22 | Massachusetts | 11 |

22 | 23 | Michigan | 16 |

23 | 24 | Minnesota | 10 |

24 | 25 | Mississippi | 6 |

25 | 26 | Missouri | 10 |

26 | 27 | Montana | 3 |

27 | 28 | Nebraska | 5 |

28 | 29 | Nevada | 6 |

29 | 30 | Hampshire | 4 |

30 | 31 | Jersey | 14 |

31 | 32 | Mexico | 5 |

32 | 33 | York | 29 |

33 | 34 | Carolina | 15 |

34 | 35 | Dakota | 3 |

35 | 36 | Ohio | 18 |

36 | 37 | Oklahoma | 7 |

37 | 38 | Oregon | 7 |

38 | 39 | Pennsylvania | 20 |

39 | 40 | Island | 4 |

40 | 41 | Carolina | 9 |

41 | 42 | Dakota | 3 |

42 | 43 | Tennessee | 11 |

43 | 44 | Texas | 38 |

44 | 45 | Utah | 6 |

45 | 46 | Vermont | 3 |

46 | 47 | Virginia | 13 |

47 | 48 | Washington | 12 |

48 | 49 | Virginia | 5 |

49 | 50 | Wisconsin | 10 |

50 | 51 | Wyoming | 3 |

We compute two histograms: one relative to the lower probability and the other relative to the upper probability. To obtain the histogram of the lower: for each State, we generate a random number r in [0,1] and we assign the electoral vote of the State to Clinton if $r \leq LowerProbability$ in the State or to Trump otherwise. (same for the upper). We also compute the lower and upper probability that the total electoral votes for Clinton exceeds the break-even line (that is equal to 269)

```
import numpy as np
#break-even line
evenline=269
#monte Carlo samples
Np=10000
lowvotes=0
upvotes=0
LowElec=np.zeros(Np)
UpElec=np.zeros(Np)
for i in range(0,Np):
lowElec=0
upElec=0
for s in range(0,51):
if np.random.rand(1)<lowerupper['LowerProbability'][s]:
lowElec=lowElec+electoralvotes['Vote'][s]
if np.random.rand(1)<lowerupper['UpperProbability'][s]:
upElec=upElec+electoralvotes['Vote'][s]
LowElec[i]=lowElec
UpElec[i]=upElec
if lowElec>evenline:
lowvotes=lowvotes+1
if upElec>evenline:
upvotes=upvotes+1
upvotes=upvotes/Np
lowvotes=lowvotes/Np
print('['+str(lowvotes) +',' +str(upvotes)+']')
```

[0.9981,0.9999]

```
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
sns.distplot(LowElec, axlabel="Electoral Votes (even-line in green)",
kde=True, hist=True,color='darkred',label=str(lowvotes))
go=sns.distplot(UpElec,
kde=True, hist=True,color='darkblue',label=str(upvotes))
go.set_title('Lower (red) and Upper (blue) distirbution for Clinton')
go.legend()
plt.axvline(x=270.,color='g')
```

<matplotlib.lines.Line2D at 0x7f1513db14a8>

```
```

## Monday, August 29, 2016

### Bayesian winning lower and upper probabilities in all 51 States

As first step, we load the data from an Excel file that includes 51 sheet (one per state) with the election poll data

```
import pandas as pd
import os
xl_file = pd.ExcelFile('StatePoll.xlsx')
df = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
```

**Covariance Intersection**discussed in the post http://idpstat.blogspot.ch/2016/08/combining-polls-data-from-different.html

```
def covariance_intersection(sample,A,B):
#A is the first candidate
#B is the second candidate
#sample is the samplesize
Af=0
Bf=0
Samplef=0
for i in range(0,len(A)):
Af=Af+A[i]*sample[i]/len(A)
Bf=Bf+B[i]*sample[i]/len(A)
Samplef=Samplef+sample[i]/len(A)
Af=Af/Samplef
Bf=Bf/Samplef
return list((Samplef,Af,Bf))
```

```
#define the function of interest for Bayesian inference
def g(theta):
#x is a numpy vector
return (theta[:,0]-theta[:,1])
#function that computes the posterior sampels
def compute_posterior_samples(ap,Np):
#ap: posterior Dirichlet distribution vector parameters
#Np: number of MC samples
return np.random.dirichlet(ap,Np) #we use numpy
import warnings
warnings.filterwarnings('ignore')
from IPython.display import Image
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#number of MC samples
Np=10000
#pseudo-votes for Near-Ignorance Priors
c=2
Statesind=0
for key, value in sorted(df.items()):
Statesind=Statesind+1
#compute the list of fused polls
fus=covariance_intersection(df[key]['SAMPLE'].values,df[key]['Clinton'].values,df[key]['Trump'].values)
#data from the poll
datapoll=np.array([fus[0]*fus[1],fus[0]*fus[2],fus[0]*(1-fus[1]-fus[2])])
#prior for a swing scenario in favor of Clinton
au=np.array([c,-c,0])
#prior for a swing scenario in favor of Trump
al=np.array([-c,c,0])
#compute the lower and upepr distirbutions for the two swing scenarios
postsampleslower = compute_posterior_samples(datapoll+al,Np)
postsamplesupper = compute_posterior_samples(datapoll+au,Np)
#Compute the lower and upepr probabilities
problower=sum(g(postsampleslower)>0)/Np
probupper=sum(g(postsamplesupper)>0)/Np
# Plot the figs and save to temp files
sns.distplot(g(postsampleslower), axlabel="Clinton-Trump",
kde=True, hist=True,color='darkred') #, hist_kws={"range": [-1,1]}
go=sns.distplot(g(postsamplesupper),
kde=True, hist=True,color='darkblue') #, hist_kws={"range": [-1,1
go.set_title(key+' ['+str(problower)+','+str(probupper)+']')
plt.axvline(x=0.,color='g')
namefile='./plots/f'+str(Statesind)+'.png'
plt.savefig(namefile);
plt.close();
```

```
# Combine them with imshows
fig, ax = plt.subplots(11,5, figsize=(30,50))
count=0
for i1 in range(0,11):
for i2 in range(0,5):
count=count+1
if count>51:
ax[i1,i2].set_visible(False)
else:
ax[i1,i2].imshow(plt.imread('./plots/f%s.png' %count), aspect='auto'); ax[i1,i2].axis('off')
plt.tight_layout()
plt.savefig('AllStates')
plt.show()
```

- both lower and upper distributions are (almost all) to the right of the green line (these are the States that are clearly for Clinton);
- both lower and upper distributions are (almost all) to the left of the green line (these are the States that are clearly for Trump);
- States where the lower and upper distributions are across the line.

The other

**undecided States**are

- Iowa [0.50,0.64]
- Missouri [0.38, 0.56]
- Nevada [0.58, 0.74]
- North-Carolina [0.52,0.66]

```
```