Friday, September 23, 2016

ECML 2016 tutorial on Bayesian vs. Frequentist tests for comparing algorithms

Tutorial went very well. It was a nice experience and we received very positive feedback. If you are interested in the content please visit this page.

Clinton vs. Trump 23th Sptember 2016


I have run again the Python code that computes the worst-case (red) and best-Case (blue) posterior distribution for Clinton winning the general USA election. using fresh (September) poll-data.    At the moment there is a quite large uncertainty but is still in favour of Clinton: the  probability of winning is between 0.78 and 0.95.  If you are interested in the methodology I have used to compute these distributions please see the past posts. If you want to try yourself, Python code and data-poll from fivethirtyeight.com are available in my github (just click on the links).
 

Friday, September 16, 2016

19 September Tutorial at ECML

Working on the slides for our Tutorial at ECML 2016 (Riva del Garda)  


G. Corani, A. Benavoli, J. Demsar.  Comparing competing algorithms: Bayesian versus frequentist hypothesis testing


Schedule

Time Duration Content Details
09:00 15min Introduction Motivations and Goals
09:15 60min Null hypothesis significance tests in machine learning NHST testing (methods and drawbacks)
10:15 25min Introduction to Bayesian tests Bayesian model comparison versus Bayesian estimation
10:40 20min Break Is the coffee in Riva del Garda better than the coffee in Porto?
11:00 35min Bayesian hypothesis testing for comparing classifiers Single and hierarchical Bayesian models
11:35 55min Non-parametric Bayesian tests and presentation of the results of Bayesian analysis Dirichlet process and how to perform nonparametric Bayesian tests
12:30 10min Summarizing! Summary and conclusions

Tuesday, August 30, 2016

General Poll for US Presidential Election 2016

We continue our adventure in the Bayesian USA 2016 election forecast through near-ignorance priors.

I will today show how to compute the lower and upepr probabilities for Clinton of winning the general election 2016. First, we load the lower and upper probabilities for Clinton of winning in every single State (see http://idpstat.blogspot.ch/2016/08/bayesian-winning-lower-and-upper.html) as well as the Electoral Vore for each state

In [12]:
import pandas as pd
lowerupper     = pd.read_csv('LowerUpper.csv')
electoralvotes = pd.read_csv('electoralvotes.csv')
In [40]:
lowerupper
Out[40]:
Unnamed: 0 LowerProbability UpperProbability
0 0 0.0026 0.0076
1 1 0.0155 0.0314
2 2 0.0367 0.0762
3 3 0.0690 0.1759
4 4 1.0000 1.0000
5 5 0.9034 0.9500
6 6 0.9937 0.9980
7 7 0.9950 0.9988
8 8 0.9850 0.9949
9 9 0.9459 0.9689
10 10 0.1747 0.2734
11 11 1.0000 1.0000
12 12 0.0013 0.0170
13 13 1.0000 1.0000
14 14 0.0003 0.0019
15 15 0.4979 0.6543
16 16 0.0098 0.0339
17 17 0.1462 0.2677
18 18 0.0081 0.0232
19 19 0.9931 0.9966
20 20 1.0000 1.0000
21 21 1.0000 1.0000
22 22 0.9549 0.9755
23 23 0.9774 0.9924
24 24 0.0006 0.0012
25 25 0.3789 0.5690
26 26 0.1162 0.3647
27 27 0.1013 0.2727
28 28 0.5729 0.7442
29 29 0.8637 0.9293
30 30 0.9999 1.0000
31 31 0.9890 0.9962
32 32 1.0000 1.0000
33 33 0.5132 0.6745
34 34 0.0003 0.0009
35 35 0.9250 0.9622
36 36 0.0000 0.0000
37 37 0.8899 0.9510
38 38 0.6900 0.8112
39 39 1.0000 1.0000
40 40 0.1803 0.3053
41 41 0.0448 0.0985
42 42 0.0079 0.0252
43 43 0.0000 0.0008
44 44 0.0001 0.0003
45 45 1.0000 1.0000
46 46 0.9994 0.9998
47 47 0.9996 1.0000
48 48 0.0712 0.2090
49 49 0.8702 0.9318
50 50 0.0000 0.0000
In [41]:
electoralvotes
Out[41]:
Index State Vote
0 1 Alabama 9
1 2 Alaska 3
2 3 Arizona 11
3 4 Arkansas 6
4 5 California 55
5 6 Colorado 9
6 7 Connecticut 7
7 8 Delaware 3
8 9 D.C. 3
9 10 Florida 29
10 11 Georgia 16
11 12 Hawaii 4
12 13 Idaho 4
13 14 Illinois 20
14 15 Indiana 11
15 16 Iowa 6
16 17 Kansas 6
17 18 Kentucky 8
18 19 Louisiana 8
19 20 Maine 4
20 21 Maryland 10
21 22 Massachusetts 11
22 23 Michigan 16
23 24 Minnesota 10
24 25 Mississippi 6
25 26 Missouri 10
26 27 Montana 3
27 28 Nebraska 5
28 29 Nevada 6
29 30 Hampshire 4
30 31 Jersey 14
31 32 Mexico 5
32 33 York 29
33 34 Carolina 15
34 35 Dakota 3
35 36 Ohio 18
36 37 Oklahoma 7
37 38 Oregon 7
38 39 Pennsylvania 20
39 40 Island 4
40 41 Carolina 9
41 42 Dakota 3
42 43 Tennessee 11
43 44 Texas 38
44 45 Utah 6
45 46 Vermont 3
46 47 Virginia 13
47 48 Washington 12
48 49 Virginia 5
49 50 Wisconsin 10
50 51 Wyoming 3

We compute two histograms: one relative to the lower probability and the other relative to the upper probability. To obtain the histogram of the lower: for each State, we generate a random number r in [0,1] and we assign the electoral vote of the State to Clinton if $r \leq LowerProbability$ in the State or to Trump otherwise. (same for the upper). We also compute the lower and upper probability that the total electoral votes for Clinton exceeds the break-even line (that is equal to 269)

In [34]:
import numpy as np
#break-even line
evenline=269
#monte Carlo samples
Np=10000

lowvotes=0
upvotes=0
LowElec=np.zeros(Np)
UpElec=np.zeros(Np)
for i in range(0,Np):
    lowElec=0
    upElec=0
    for s in range(0,51):
        if np.random.rand(1)<lowerupper['LowerProbability'][s]:
            lowElec=lowElec+electoralvotes['Vote'][s]
        if np.random.rand(1)<lowerupper['UpperProbability'][s]:
            upElec=upElec+electoralvotes['Vote'][s]
    LowElec[i]=lowElec
    UpElec[i]=upElec
    if lowElec>evenline:
        lowvotes=lowvotes+1
    if upElec>evenline:
        upvotes=upvotes+1
    
upvotes=upvotes/Np
lowvotes=lowvotes/Np
print('['+str(lowvotes) +',' +str(upvotes)+']')
[0.9981,0.9999]
In [44]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt

sns.distplot(LowElec, axlabel="Electoral Votes (even-line in green)", 
                 kde=True, hist=True,color='darkred',label=str(lowvotes)) 
go=sns.distplot(UpElec, 
                 kde=True, hist=True,color='darkblue',label=str(upvotes)) 
go.set_title('Lower (red) and Upper (blue) distirbution for Clinton')
go.legend()
plt.axvline(x=270.,color='g')
Out[44]:
<matplotlib.lines.Line2D at 0x7f1513db14a8>
In [ ]:
 

Monday, August 29, 2016

Bayesian winning lower and upper probabilities in all 51 States

This post is about how to perform a Bayesian analysis of election polls for USA 2016 presidential election. In previous posts, I have discussed how to make a poll for a single State (Nevada as example). Here we will use some simple Python functions to compue the probability for Clinton of winning in all 51 States (State-by-State). I will make use of near-prior ignorance models that were introduced in a past post.
As first step, we load the data from an Excel file that includes 51 sheet (one per state) with the election poll data
In [59]:
import pandas as pd
import os
xl_file = pd.ExcelFile('StatePoll.xlsx')
df = {sheet_name: xl_file.parse(sheet_name) 
          for sheet_name in xl_file.sheet_names}
The following function implements the Covariance Intersection discussed in the post http://idpstat.blogspot.ch/2016/08/combining-polls-data-from-different.html
In [2]:
def covariance_intersection(sample,A,B):
    #A is the first candidate
    #B is the second candidate
    #sample is the samplesize
    Af=0
    Bf=0
    Samplef=0
    for i in range(0,len(A)):
        Af=Af+A[i]*sample[i]/len(A)
        Bf=Bf+B[i]*sample[i]/len(A)
        Samplef=Samplef+sample[i]/len(A)
    Af=Af/Samplef
    Bf=Bf/Samplef    
    return list((Samplef,Af,Bf))
We implement the Bayesian election forecast using the multinomial-Dirichlet model with near-ignorance priors discussed in http://idpstat.blogspot.ch/2016/08/blog-post_9.html for Nevada. Here the procedure is applied to all States. The following loop computes the poll state-by-state and save the resulting posterior lwoer and upper distirbutions in png files
In [67]:
#define the function of interest for Bayesian inference
def g(theta):
    #x is a numpy vector
    return (theta[:,0]-theta[:,1])

#function that computes the posterior sampels
def compute_posterior_samples(ap,Np):
    #ap: posterior Dirichlet distribution vector parameters
    #Np: number of MC samples
    return np.random.dirichlet(ap,Np) #we use numpy

import warnings
warnings.filterwarnings('ignore')
from IPython.display import Image
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#number of MC samples
Np=10000
#pseudo-votes for Near-Ignorance Priors
c=2

Statesind=0
for key, value in sorted(df.items()):
    Statesind=Statesind+1
    #compute the list of fused polls
    fus=covariance_intersection(df[key]['SAMPLE'].values,df[key]['Clinton'].values,df[key]['Trump'].values)
    #data from the poll
    datapoll=np.array([fus[0]*fus[1],fus[0]*fus[2],fus[0]*(1-fus[1]-fus[2])])
    #prior for a swing scenario in favor of Clinton
    au=np.array([c,-c,0])
    #prior for a swing scenario in favor of Trump
    al=np.array([-c,c,0])

    #compute the lower and upepr distirbutions for the two swing scenarios
    postsampleslower = compute_posterior_samples(datapoll+al,Np)
    postsamplesupper = compute_posterior_samples(datapoll+au,Np)
    #Compute the lower and upepr probabilities
    problower=sum(g(postsampleslower)>0)/Np
    probupper=sum(g(postsamplesupper)>0)/Np
    
    
    # Plot the figs and save to temp files
    sns.distplot(g(postsampleslower), axlabel="Clinton-Trump", 
                 kde=True, hist=True,color='darkred') #, hist_kws={"range": [-1,1]}
    go=sns.distplot(g(postsamplesupper), 
                 kde=True, hist=True,color='darkblue') #, hist_kws={"range": [-1,1
    
    go.set_title(key+'    ['+str(problower)+','+str(probupper)+']')
    plt.axvline(x=0.,color='g')
    namefile='./plots/f'+str(Statesind)+'.png'
    plt.savefig(namefile); 
    plt.close();
    
   
In [68]:
# Combine them with imshows
fig, ax = plt.subplots(11,5, figsize=(30,50))
count=0
for i1 in range(0,11):
    for i2 in range(0,5):
        count=count+1
        if count>51:
            ax[i1,i2].set_visible(False)
        else:
            ax[i1,i2].imshow(plt.imread('./plots/f%s.png' %count), aspect='auto'); ax[i1,i2].axis('off')
            plt.tight_layout()
            
plt.savefig('AllStates')
plt.show()
The green line represents the "even-line". We have therefore three different situaions:
  1. both lower and upper distributions are (almost all) to the right of the green line (these are the States that are clearly for Clinton);
  2. both lower and upper distributions are (almost all) to the left of the green line (these are the States that are clearly for Trump);
  3. States where the lower and upper distributions are across the line.
For instance, States that are clearly for Clinton are California, Connecticut, Delaware, etc. while States that are for Trump are Alabama, Alaska, Arizona etc.. The values of the lower and upepr probability of winning the State for Clinton are reported in the top of the plots. One of advantages of near-prior ignorance model is that they allow us to automatically detect the States that are more uncertain, i.e., States where a small change of the vote intention can dramatically change the final result. Let us consider for instance the undecided State 'North-Carolina' (it is undecided because the lower and upper probabilitie are 0.52 and 0.66). It can be observed that a change of the vote of only 2 people from Clinton to Trump (that is 0.4% of the poll sample size) reduces Clinton's probability of winning to 0.52.
The other undecided States are
  • Iowa [0.50,0.64]
  • Missouri [0.38, 0.56]
  • Nevada [0.58, 0.74]
  • North-Carolina [0.52,0.66]
We are almost ready to compute the general election result, in the next post.