Python for statistical analysis

From Notepedia
Jump to: navigation, search


Distribution: set of all possible random variables

Binomial Distributions

eg: Coin Flip
  • Binomial = 2 possible outputs
  • Discrete = categories, not real numbers
  • Evenly weighted = equal chance of either side
import numpy as np
np.random.binomial(Times_to_run, chance_of_zero, size=sample_size)

# Coin eg:
np.random.binomial(1, 0.5)
Uniform distribution

Uniform distribution

Result is not a category, it's a value. Intended to graph these:

x = Value of observation

y = Probability an observation will occur

Normal Gaussian distribution

Normal Distribution

Bell curve, mean is the center value.

Expected value = The mean if the experiment was done infinitely

Mean value = The average value given the sample taken

Variance = Measure of how broadly values are from the mean

Central tendency

  • Mode
  • Median
  • Mean

Standard deviation

How different a value is from the mean
distribution = np.random.normal(0.75,size=1000)

# Manually calculating std dev

# Numpy std dev calculation 


Shape of the tails of the distribution
# Negative = more flat than normal distribution 
# Positive = more peaky than normal distribution 
import scipy.stats as stats

Chi Squared Distribution

Chi squared - degrees of freedom

Left skewed distribution

As degrees of freedom increases, the curve gets more normal
# More left skewed
chi_squared_df2 = np.random.chisquare(2, size=10000)

# Less left skewed
chi_squared_df5 = np.random.chisquare(5, size=10000)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

output = plt.hist([chi_squared_df2,chi_squared_df5], bins=50, histtype='step', 
                  label=['2 degrees of freedom','5 degrees of freedom'])
plt.legend(loc='upper right')

Hypothesis testing

Hypothesis is a statement we can test. An example of this is found within A/B testing, very popular on the web.

Alternative hypothesis: Our hypothesis

Null hypothesis: The opposite of our idea, there is no difference between the groups

Critical value alpha : Significance level

  • How much chance you're willing to accept
  • Social sciences 0.1, 0.05, or 0.01, or Physics 10^-5
The Scipy library has T-tests available to compare values for significance
from scipy import stats
stats.ttest_ind(early['ass_results'], late['ass_results'])

>>> Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)
Note, the above pvalue is > 0.05 (alpha) so we cannot reject the null hypothesis. Or there's no statistically significant difference between the two samples.

P-hacking/Dredging: The more T-tests that are performed, the more likely you'll eventually get something of a significant value. Workarounds:

  • Bonferroni correction - Simply tighten the alpha value based on the number of tests that is taken
    • Eg: If 3 tests are made, at a=0.05 then 0.05/3 = 0.017 each test must be that significant (conservative)
  • Hold out tests
  • Investigation pre-registration