Python for statistical analysis

From Notepedia
(Redirected from Python Statistical Analysis)
Jump to: navigation, search

Distributions

Distribution: set of all possible random variables

Binomial Distributions

eg: Coin Flip
  • Binomial = 2 possible outputs
  • Discrete = categories, not real numbers
  • Evenly weighted = equal chance of either side
import numpy as np
np.random.binomial(Times_to_run, chance_of_zero, size=sample_size)

# Coin eg:
np.random.binomial(1, 0.5)
Uniform distribution

Uniform distribution

Result is not a category, it's a value. Intended to graph these:

x = Value of observation

y = Probability an observation will occur

Normal Gaussian distribution

Normal Distribution

Bell curve, mean is the center value.

Expected value = The mean if the experiment was done infinitely

Mean value = The average value given the sample taken

Variance = Measure of how broadly values are from the mean

Central tendency

  • Mode
  • Median
  • Mean

Standard deviation

How different a value is from the mean
distribution = np.random.normal(0.75,size=1000)

# Manually calculating std dev
np.sqrt(np.sum((np.mean(distribution)-distribution)**2)/len(distribution))

# Numpy std dev calculation 
np.std(distribution)

Kurtosis

Shape of the tails of the distribution
# Negative = more flat than normal distribution 
# Positive = more peaky than normal distribution 
 
import scipy.stats as stats
stats.kurtosis(distribution)

Chi Squared Distribution

Chi squared - degrees of freedom

Left skewed distribution

As degrees of freedom increases, the curve gets more normal
# More left skewed
chi_squared_df2 = np.random.chisquare(2, size=10000)
stats.skew(chi_squared_df2)

# Less left skewed
chi_squared_df5 = np.random.chisquare(5, size=10000)
stats.skew(chi_squared_df5)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

output = plt.hist([chi_squared_df2,chi_squared_df5], bins=50, histtype='step', 
                  label=['2 degrees of freedom','5 degrees of freedom'])
plt.legend(loc='upper right')

Hypothesis testing

Hypothesis is a statement we can test. An example of this is found within A/B testing, very popular on the web.

Alternative hypothesis: Our hypothesis

Null hypothesis: The opposite of our idea, there is no difference between the groups

Critical value alpha : Significance level

  • How much chance you're willing to accept
  • Social sciences 0.1, 0.05, or 0.01, or Physics 10^-5
The Scipy library has T-tests available to compare values for significance
from scipy import stats
stats.ttest_ind?
stats.ttest_ind(early['ass_results'], late['ass_results'])

>>> Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)
Note, the above pvalue is > 0.05 (alpha) so we cannot reject the null hypothesis. Or there's no statistically significant difference between the two samples.

P-hacking/Dredging: The more T-tests that are performed, the more likely you'll eventually get something of a significant value. Workarounds:

  • Bonferroni correction - Simply tighten the alpha value based on the number of tests that is taken
    • Eg: If 3 tests are made, at a=0.05 then 0.05/3 = 0.017 each test must be that significant (conservative)
  • Hold out tests
  • Investigation pre-registration