<h1>9. Data Analysis Exercise</h1>
<h2>10/23/2023</h2>

<h2>9.0 Last Time...</h2>
<ul>
    <li>The <b>open()</b> statement lets you open a file in read, write, or append mode.</li>
    <li>Files should always be closed using the <b>close()</b> statement.</li>
    <li>You can read a single line with <b>readline()</b>, and multiple lines with <b>readlines()</b>.</li>
    <li>The <b>write()</b> method allows you to write a single line, and the <b>writelines()</b> method allows you to write multiple lines.</li>
    <li><b>split()</b> lets you break strings based on defined separators.</li>
</ul>

<h2>9.1 The General Idea...</h2>

Today we're going to be working our way through various ways of analyzing datasets. The datasets in question are called <b>data0001.txt</b>, <b>data0002.txt</b>, and <b>data0003.txt</b>.

These datasets are just composed of randomly generated numbers that have particular statistical features. We're going to make use of file I/O techniques to calculate certain statistics for each of them.

<h3>9.1.1 A Quick Review of Statistics</h3>

As a review, the <b>mean</b> is what we typically think of as "average": the sum of all elements, divided by the total number of elements. A mean is a useful summary, but it is <b>sensitive to outliers</b>. As an example, consider the following:

In [1]:
import numpy as np
a = np.array([1,2,3,4,5])
print(np.mean(a))

3.0


In [2]:
# Now consider an outlier: a value that's much higher or lower than the others.
a = np.array([1,2,3,4,155555])
print(np.mean(a))
# the mean is not resistant to outliers


31113.0


If we suspect there are outliers in the data, we can instead use the <b>median</b>, which is not sensitive to outliers. The median simply organizes all the values in ascending order and picks the middle one (or averages the middle two). The median is said to be <b>resistant to outliers</b>.

In [4]:
a = np.array([1,2,3,4,55555])
print(np.median(a))

3.0


Finding the mean or median of a dataset is only part of the story: we're often also interested in how spread out the data are. Consider the following examples:

In [5]:
print(np.std(a))

22221.000022501237


The arrays have the same mean (and the same median, actually), but the spread of values is very different. As a result, we use measures of spread such as the <b>standard deviation</b>, which is essentially a measure of the average distance between each value and the mean.

So we can tell that the second array is more spread out than the first.

Okay, but how does the standard deviation do when it comes to outliers?

There's a big difference there! So just like how the median can be used as a statistic instead of the mean when we suspect there are outliers, we can also use an outlier-resistant measure of spread called the <b>inter-quartile range (IQR)</b>.

After sorting the data in ascending order, a <b>quartile</b> corresponds to a quarter of the data. Counting upward through the data, once we've reached 1/4 of the data, we've reached the first quartile. The second quartile is when we've reached 1/2 of the data (so the <b>second quartile is equal to the median</b>). The third quartile is when we've reached 3/4 of the data.

The interquartile range is simply the difference between the 3rd quartile and the 1st quartile. This function isn't in NumPy (yet!), but it is in scipy.stats.

In [6]:
import scipy.stats as S
import numpy as np

a = np.array([1,2,3,4,5,6,7])
b = np.array([1,2,3,4,5,6,1552345])

print(S.iqr(a))
print(S.iqr(b))

3.0
3.0


If you want to get fancy with your analysis, there's also <b>skewness</b> and <b>kurtosis</b>, but they're harder to puzzle out by hand.

<b>Skewness</b> is a measure of how asymmetrical your distribution is: negative skew means a plot of the data has a longer left tail, whereas positive skew means a plot of the data has a longer right tail.

<b>Kurtosis</b> is a measure of how sharp the peak is in a distribution, as compared to a Gaussian (bell-curve). If the kurtosis is greater than 3, it's got a sharper curve than a Gaussian distribution. If it's less than 3, it's got a more gradual curve than a Gaussian distribution.

These are more complicated statistics, but you may come across the names, and they can come in handy when you're doing data analysis!

In [7]:
print(S.skew(a))
print(S.skew(b))

print(S.kurtosis(a))
print(S.kurtosis(b))


0.0
2.0412414522829976
-1.25
2.1666666665875907


<h2>9.2 A Traditional Approach</h2>

We're going to call this a traditional approach because it's the sort of thing you could do in just about any programming language; it doesn't really take advantage of the power of Python.

Our goal is to calculate the mean, median, standard deviation, IQR, skewness, and kurtosis of each of the 3 datasets!

In [None]:
# Start by importing the relevant packages.
import numpy as np
import scipy.stats as s

# Let's create a function that will read data from any file.
# The function has one argument: the name of the file.
def read(file):
    fileobj = open(file, "r")
    # Start by defining a file object. We're opening in read-only mode.
    outputstr = fileobj.readlines()
    # Next, use readlines() to create a variable containing all the data.
    fileobj.close()
    # Close the file!
    outputarray = np.zeros(len(outputstr))
    # Let's initalize an array that will contain all the individual values from the file.
    for n in np.arange(len(outputstr)):
        outputarray[i] = float(outputstr[i]) 
    # Finally, let's loop over all the lines and put their values into this new array.
    return outputarray
    # We now have a function that takes in a file name and puts
    # all the data into an array!
    # The final step is to return the data array.


# Okay, so let's make use of this function for our three datasets.
data1 = read("../datasets/data001.txt")
data2 = read("../datasets/data002.txt")
data3 = read("../datasets/data003.txt")

# Calculate the stats!
mean1 = np.mean(data1)
mean1 = np.mean(data1)
mean1 = np.mean(data1)


# Printing:


<h2>9.3 Array Storage</h2>

We can do better than that! Let's make use of arrays for the results.

In [None]:
# Import the necessary packages.
import numpy as np
import scipy.stats as s

# Let's initialize arrays of our final values!
numfiles = 3
mean = np.zeros(numfiles)
median = np.zeros(numfiles)
std = np.zeros(numfiles)

# Now, let's use a loop to calculate the values!
    # We can use the index from the loop to name each file!

    # Now, just use readdata() to grab all the data from the file.

    # Calculate your statistics!


# Once the loop is complete, print out the arrays.


That worked fairly well, but the big concern here is that sometimes your files won't be as nicely numbered as they are.

<h2>9.4 Dictionary Storage</h2>

How can we use dictionaries to our advantage? This might solve our problem with our filenames! Instead of relying on them to be a perfectly numbered list, we can use them as keys in a dictionary.

And there's a new import command we can use that will grab all the file names!

In [None]:
# The usual suspects.
import numpy as np
import scipy.stats as s
# And a new friend!
import glob

# Let's start by getting a list of files in the directory.
# We don't want to grab EVERYTHING, so we'll say it has to start with the word 'data' and end with '.txt.'
filelist = glob.glob("../datasets/data*.txt")
filelist.sort()

# Now initialize our dictionaries as empty to begin with.
mean = {}
median = {}
stddev = {}
iqr = {}
skewness = {}
kurtosis = {}

# Loop through all files.
for i in filelist:
    # Read the data.
    data = read(i)
    # Assign key-value pairs!
 
    
# And, outside the loop, print the results.


<h2>9.5 MORE Dictionary Storage</h2>

Okay, well, what if we didn't want to have to make a separate dictionary for every statistical metric? Remember, the key:value pairs in dictionaries are very flexible and actually allow you to put dictionaries themselves into the values!

In [None]:
# Old friends, back again.


# First, create a dictionary of metrics with the commands you'll need to calculate them.

# And we get our files the usual way.

# Now let's initialize a results dictionary for each metric.

# Now loop through all files, storing the relevant metrics!


The power of what Python's enabled us to do here is that we can change almost anything very easily: adding or removing files, adding or removing metrics, it's all done with one or two lines of code at most. The first version we saw would have been <b>much</b> more complicated!

<h2>9.6 Take-Home Points</h2>
<ul>
    <li>Statistics that are not resistant to outliers include the mean, the standard deviation, skewness, and kurtosis.</li>
    <li>Statistics that are resistant to outliers include the median and the interquartile range.</li>
    <li>By making use of dictionaries, we can create versatile, non-hard-coded programs!</li>
    <li>glob is a package that enables us to grab all files within a directory</li>
</ul>