<h1>15. Missing Values</h1>
<h2>11/13/2023</h2>

<h2>15.0 Last Time...</h2>
<ul>
    <li><b>Pandas</b> is a useful way of working with CSV data!</li>
    <li>A <b>dataframe</b> is an object that contains rows and columns, much like an Excel spreadsheet.</li>
    <li><b>loc()</b> will let you identify individual rows, columns, or values.</li>
    <li><b>describe()</b> summarizes statistics for a specified section of a dataframe.</li>
    <li><b>read_csv()</b> will read in a CSV file specified by a file location.</li>
    <li><b>groupby()</b> carries out specific operations on groupings within a dataframe.</li>
</ul>

<h2>15.1 Masked Arrays</h2>

<b>Masked</b> arrays are just like normal arrays, except that they have a "mask" attribute to tell you which elements are bad.

Recall how arrays normally work:

In [2]:
# Let's create a 2D array that contains the numbers 1-6.
import numpy as np
import pandas as pd

a = np.array([[1,2,3],[4,5,6]])
print(a)


[[1 2 3]
 [4 5 6]]


If we have some information that maybe the last two values are suspicious and may consist of bad data, we can create a <b>mask</b> of bad values that will travel with the array. Elements in the array whose mask value corresponds to "bad" are treated as if they did not exist, and operations using the array automatically consider that mask of bad values.

This is extremely useful! Sometimes we have a dataset that's read-only, or we want to be aware of precisely which data are suspect, so instead of deleting them, we just keep all information and have a flag on which values are bad.

For this purpose, NumPy has a function called <b>numpy.ma</b>.

In [7]:
import numpy.ma as ma
 # This saves us having to type 'np.' at the start of every instance of numpy.ma.

a = np.array([[1,2,3],[4,5,6]])
b = ma.masked_greater(a,4)

print(b)
# Let's set our mask to everything greater than 4.
print(b.mask)
print(b.data)

[[1 2 3]
 [4 -- --]]
[[False False False]
 [False  True  True]]
[[1 2 3]
 [4 5 6]]


In [8]:
# Now, if we try to do an operation on our masked array:
print(b*3)


[[3 6 9]
 [12 -- --]]


When we have a masked array, any operations applied to elements whose mask value is set to True will create a resulting array that also has the corresponding elements' mask values set to True. Masked arrays thus transparently deal with missing data.

<h2>15.2 Constructing and Deconstructing Masked Arrays</h2>

There are several different ways to construct a masked array; we saw one example above, but (as always!) Python provides us with options.

We can explicitly specify a mask!

In [10]:
a = ma.masked_array(data=[1,2,3],mask=[True,True,False])
print(a)
print(a.data)
print(a.mask)

[-- -- 3]
[1 2 3]
[ True  True False]


A lot of the time, we'll determine whether or not data values should be masked on the basis of some logical test (e.g., whether data values are beyond an acceptable value - like negative rainfall amounts!).

We can make a masked array by masking values based on conditions! This can be done with some specific functions like <b>numpy.ma.masked_greater()</b> and <b>numpy.ma.masked_where()</b>.

In [13]:
# Mask all values greater than 3.
data = np.array([1,2,3,4,5])
a = ma.masked_greater(data,3)
print(a)


[1 2 3 -- --]


In [14]:
# Mask all values greater than 2 and less than 5.
b = ma.masked_where(np.logical_and(data>2,data<5),data)
print(b)


[1 2 -- -- 5]


Sometimes we might want to export our results to a file that doesn't support object attributes (for example, a text or comma-separated value file). In those cases, it makes sense to replace masked values with some value that we know is nonsense, which we can do using <b>numpy.ma.filled()</b>.

In [17]:
c = ma.masked_array(data=[1.,2.,3.],mask=[True,True,False],fill_value=-1e+23)
print(c)

d = ma.filled(c)
print(d)

[-- -- 3.0]
[-1.e+23 -1.e+23  3.e+00]


<h2>15.3 An Example</h2>

As an example, let's revisit the <b>air.mon.mean.nc</b> NetCDF file from before. This dataset consists of air temperature in Celsius for the global domain. Let's look at the first time slice of this dataset and mask out temperatures in all locations greater than 45N and less than 45S, then convert the remaining temperatures to Kelvins (K = 273.15 + C).

In [20]:
# First, import the important packages.
import scipy.io as sc


# Open the file in read-only mode.
fileobj = sc.netcdf_file("../datasets/air.mon.mean.nc",mode="r")

# Create three variables: temp, lat, and lon.
 # Remember, we only want the first time step!
temp = fileobj.variables["air"][0,:,:]
lat = fileobj.variables["lat"][:]
lon = fileobj.variables["lon"][:]

# Use meshgrid() to create a lat-lon grid.

[lonall,latall] = np.meshgrid(lon,lat)


<b>1. With the above code to get you started, create a masked array called ma_temp that masks all latitudes greater than 45 and less than -45.</b>

In [27]:
ma_temp = ma.masked_where(np.logical_or(latall>45,latall<-45),temp)
print(x)

[[-- -- -- ... -- -- --]
 [-- -- -- ... -- -- --]
 [-- -- -- ... -- -- --]
 ...
 [-- -- -- ... -- -- --]
 [-- -- -- ... -- -- --]
 [-- -- -- ... -- -- --]]


<b>2. Next, convert all temperatures in the unmasked region (between 45N and 45S) to Kelvins.</b>

In [32]:
ma_temp = ma_temp+273.15
print(kelvin)

[[-- -- -- ... -- -- --]
 [-- -- -- ... -- -- --]
 [-- -- -- ... -- -- --]
 ...
 [-- -- -- ... -- -- --]
 [-- -- -- ... -- -- --]
 [-- -- -- ... -- -- --]]


You can check the results with the following code:

In [33]:
print('North pole: ',ma_temp[0,:])
print('South pole: ',ma_temp[-1,:])
print('Equator: ',ma_temp[36,:])

North pole:  [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --]
South pole:  [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --]
Equator:  [298.989990234375 298.8916015625 298.92547607421875 299.78387451171875
 297.4780578613

<h2>15.4 Take-Home Points</h2>
<ul>
    <li>A masked array has a <b>mask</b> attribute that allows us to identify suspicious or unwanted data.</li>
    <li>We can use direct assignment, assignment by condition, and filling to create a masked array.</li>
</ul>