# Python for statistical analysis

## Contents

# Distributions

**Distribution**: set of all possible random variables

### Binomial Distributions

##### eg: Coin Flip

- Binomial = 2 possible outputs
- Discrete = categories, not real numbers
- Evenly weighted = equal chance of either side

```
import numpy as np
np.random.binomial(Times_to_run, chance_of_zero, size=sample_size)
# Coin eg:
np.random.binomial(1, 0.5)
```

### Uniform distribution

Result is not a category, it's a value. Intended to graph these:

x = Value of observation

y = Probability an observation will occur

### Normal Gaussian distribution

Bell curve, mean is the center value.

__Expected value__ = The mean if the experiment was done infinitely

__Mean value__ = The average value given the sample taken

__Variance__ = Measure of how broadly values are from the mean

Central tendency

- Mode
- Median
- Mean

#### Standard deviation

How different a value is from the mean```
distribution = np.random.normal(0.75,size=1000)
# Manually calculating std dev
np.sqrt(np.sum((np.mean(distribution)-distribution)**2)/len(distribution))
# Numpy std dev calculation
np.std(distribution)
```

#### Kurtosis

Shape of the tails of the distribution```
# Negative = more flat than normal distribution
# Positive = more peaky than normal distribution
import scipy.stats as stats
stats.kurtosis(distribution)
```

#### Chi Squared Distribution

Left skewed distribution

As degrees of freedom increases, the curve gets more normal```
# More left skewed
chi_squared_df2 = np.random.chisquare(2, size=10000)
stats.skew(chi_squared_df2)
# Less left skewed
chi_squared_df5 = np.random.chisquare(5, size=10000)
stats.skew(chi_squared_df5)
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
output = plt.hist([chi_squared_df2,chi_squared_df5], bins=50, histtype='step',
label=['2 degrees of freedom','5 degrees of freedom'])
plt.legend(loc='upper right')
```

# Hypothesis testing

Hypothesis is a statement we can test. An example of this is found within A/B testing, very popular on the web.

**Alternative hypothesis**: Our hypothesis

**Null hypothesis**: The opposite of our idea, there is no difference between the groups

**Critical value alpha** : Significance level

- How much chance you're willing to accept
- Social sciences 0.1, 0.05, or 0.01, or Physics 10^-5

```
from scipy import stats
stats.ttest_ind?
stats.ttest_ind(early['ass_results'], late['ass_results'])
>>> Ttest_indResult(statistic=1.400549944897566, pvalue=0.16148283016060577)
```

**P-hacking/Dredging**: The more T-tests that are performed, the more likely you'll eventually get something of a significant value. Workarounds:

- Bonferroni correction - Simply tighten the alpha value based on the number of tests that is taken
- Eg: If 3 tests are made, at a=0.05 then 0.05/3 = 0.017 each test must be that significant (conservative)

- Hold out tests
- Investigation pre-registration