Files
misc/python/atms-310/notebooks/Week 08 M.ipynb
2025-07-02 00:07:49 -07:00

443 lines
14 KiB
Plaintext
Executable File

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>15. Missing Values</h1>\n",
"<h2>11/13/2023</h2>\n",
"\n",
"<h2>15.0 Last Time...</h2>\n",
"<ul>\n",
" <li><b>Pandas</b> is a useful way of working with CSV data!</li>\n",
" <li>A <b>dataframe</b> is an object that contains rows and columns, much like an Excel spreadsheet.</li>\n",
" <li><b>loc()</b> will let you identify individual rows, columns, or values.</li>\n",
" <li><b>describe()</b> summarizes statistics for a specified section of a dataframe.</li>\n",
" <li><b>read_csv()</b> will read in a CSV file specified by a file location.</li>\n",
" <li><b>groupby()</b> carries out specific operations on groupings within a dataframe.</li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>15.1 Masked Arrays</h2>\n",
"\n",
"<b>Masked</b> arrays are just like normal arrays, except that they have a \"mask\" attribute to tell you which elements are bad.\n",
"\n",
"Recall how arrays normally work:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1 2 3]\n",
" [4 5 6]]\n"
]
}
],
"source": [
"# Let's create a 2D array that contains the numbers 1-6.\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"a = np.array([[1,2,3],[4,5,6]])\n",
"print(a)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we have some information that maybe the last two values are suspicious and may consist of bad data, we can create a <b>mask</b> of bad values that will travel with the array. Elements in the array whose mask value corresponds to \"bad\" are treated as if they did not exist, and operations using the array automatically consider that mask of bad values.\n",
"\n",
"This is extremely useful! Sometimes we have a dataset that's read-only, or we want to be aware of precisely which data are suspect, so instead of deleting them, we just keep all information and have a flag on which values are bad.\n",
"\n",
"For this purpose, NumPy has a function called <b>numpy.ma</b>."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1 2 3]\n",
" [4 -- --]]\n",
"[[False False False]\n",
" [False True True]]\n",
"[[1 2 3]\n",
" [4 5 6]]\n"
]
}
],
"source": [
"import numpy.ma as ma\n",
" # This saves us having to type 'np.' at the start of every instance of numpy.ma.\n",
"\n",
"a = np.array([[1,2,3],[4,5,6]])\n",
"b = ma.masked_greater(a,4)\n",
"\n",
"print(b)\n",
"# Let's set our mask to everything greater than 4.\n",
"print(b.mask)\n",
"print(b.data)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[3 6 9]\n",
" [12 -- --]]\n"
]
}
],
"source": [
"# Now, if we try to do an operation on our masked array:\n",
"print(b*3)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we have a masked array, any operations applied to elements whose mask value is set to True will create a resulting array that also has the corresponding elements' mask values set to True. Masked arrays thus transparently deal with missing data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>15.2 Constructing and Deconstructing Masked Arrays</h2>\n",
"\n",
"There are several different ways to construct a masked array; we saw one example above, but (as always!) Python provides us with options.\n",
"\n",
"We can explicitly specify a mask!"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-- -- 3]\n",
"[1 2 3]\n",
"[ True True False]\n"
]
}
],
"source": [
"a = ma.masked_array(data=[1,2,3],mask=[True,True,False])\n",
"print(a)\n",
"print(a.data)\n",
"print(a.mask)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A lot of the time, we'll determine whether or not data values should be masked on the basis of some logical test (e.g., whether data values are beyond an acceptable value - like negative rainfall amounts!).\n",
"\n",
"We can make a masked array by masking values based on conditions! This can be done with some specific functions like <b>numpy.ma.masked_greater()</b> and <b>numpy.ma.masked_where()</b>."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1 2 3 -- --]\n"
]
}
],
"source": [
"# Mask all values greater than 3.\n",
"data = np.array([1,2,3,4,5])\n",
"a = ma.masked_greater(data,3)\n",
"print(a)\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1 2 -- -- 5]\n"
]
}
],
"source": [
"# Mask all values greater than 2 and less than 5.\n",
"b = ma.masked_where(np.logical_and(data>2,data<5),data)\n",
"print(b)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes we might want to export our results to a file that doesn't support object attributes (for example, a text or comma-separated value file). In those cases, it makes sense to replace masked values with some value that we know is nonsense, which we can do using <b>numpy.ma.filled()</b>."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-- -- 3.0]\n",
"[-1.e+23 -1.e+23 3.e+00]\n"
]
}
],
"source": [
"c = ma.masked_array(data=[1.,2.,3.],mask=[True,True,False],fill_value=-1e+23)\n",
"print(c)\n",
"\n",
"d = ma.filled(c)\n",
"print(d)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>15.3 An Example</h2>\n",
"\n",
"As an example, let's revisit the <b>air.mon.mean.nc</b> NetCDF file from before. This dataset consists of air temperature in Celsius for the global domain. Let's look at the first time slice of this dataset and mask out temperatures in all locations greater than 45N and less than 45S, then convert the remaining temperatures to Kelvins (K = 273.15 + C)."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# First, import the important packages.\n",
"import scipy.io as sc\n",
"\n",
"\n",
"# Open the file in read-only mode.\n",
"fileobj = sc.netcdf_file(\"../datasets/air.mon.mean.nc\",mode=\"r\")\n",
"\n",
"# Create three variables: temp, lat, and lon.\n",
" # Remember, we only want the first time step!\n",
"temp = fileobj.variables[\"air\"][0,:,:]\n",
"lat = fileobj.variables[\"lat\"][:]\n",
"lon = fileobj.variables[\"lon\"][:]\n",
"\n",
"# Use meshgrid() to create a lat-lon grid.\n",
"\n",
"[lonall,latall] = np.meshgrid(lon,lat)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>1. With the above code to get you started, create a masked array called ma_temp that masks all latitudes greater than 45 and less than -45.</b>"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-- -- -- ... -- -- --]\n",
" [-- -- -- ... -- -- --]\n",
" [-- -- -- ... -- -- --]\n",
" ...\n",
" [-- -- -- ... -- -- --]\n",
" [-- -- -- ... -- -- --]\n",
" [-- -- -- ... -- -- --]]\n"
]
}
],
"source": [
"ma_temp = ma.masked_where(np.logical_or(latall>45,latall<-45),temp)\n",
"print(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>2. Next, convert all temperatures in the unmasked region (between 45N and 45S) to Kelvins.</b>"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-- -- -- ... -- -- --]\n",
" [-- -- -- ... -- -- --]\n",
" [-- -- -- ... -- -- --]\n",
" ...\n",
" [-- -- -- ... -- -- --]\n",
" [-- -- -- ... -- -- --]\n",
" [-- -- -- ... -- -- --]]\n"
]
}
],
"source": [
"ma_temp = ma_temp+273.15\n",
"print(kelvin)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can check the results with the following code:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"North pole: [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --]\n",
"South pole: [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --\n",
" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --]\n",
"Equator: [298.989990234375 298.8916015625 298.92547607421875 299.78387451171875\n",
" 297.4780578613281 294.6690368652344 295.40869140625 296.71514892578125\n",
" 296.8448181152344 296.437744140625 295.9793395996094 293.3296813964844\n",
" 292.00128173828125 294.366455078125 293.1861267089844 294.57452392578125\n",
" 301.221923828125 301.32000732421875 298.9042053222656 298.9080505371094\n",
" 298.2835388183594 298.419677734375 298.5970764160156 298.6822509765625\n",
" 298.8238525390625 298.6919250488281 298.71484375 298.830322265625\n",
" 299.1283874511719 298.9522399902344 298.6025695800781 298.3016052246094\n",
" 298.3628845214844 298.5619201660156 298.70159912109375 298.83966064453125\n",
" 298.58514404296875 298.8529052734375 298.94903564453125 298.3051452636719\n",
" 296.6951599121094 296.3338623046875 298.24224853515625 298.41387939453125\n",
" 296.9270935058594 294.44805908203125 294.5732116699219 297.2270812988281\n",
" 296.748046875 297.1932067871094 298.6135559082031 298.6080627441406\n",
" 298.5787048339844 297.7358093261719 298.6219177246094 300.23065185546875\n",
" 300.4396667480469 300.650634765625 300.4158020019531 300.04449462890625\n",
" 300.2264404296875 300.44580078125 300.19580078125 299.89288330078125\n",
" 299.8716125488281 299.91741943359375 299.8864440917969 299.64739990234375\n",
" 299.6064453125 299.6180725097656 299.51739501953125 299.43450927734375\n",
" 299.3219299316406 299.2790222167969 299.062255859375 299.01611328125\n",
" 299.1012878417969 299.0425720214844 299.07806396484375 298.74322509765625\n",
" 298.59063720703125 298.52288818359375 298.23419189453125\n",
" 298.3409729003906 298.168701171875 298.0899963378906 298.0496826171875\n",
" 297.89288330078125 298.1625671386719 297.9141845703125 297.8422546386719\n",
" 297.8786926269531 297.53643798828125 297.52288818359375 297.2799987792969\n",
" 297.0796813964844 297.2138671875 297.080322265625 296.75933837890625\n",
" 296.722900390625 296.6477355957031 296.57708740234375 296.4858093261719\n",
" 296.3219299316406 296.5787048339844 296.8248291015625 297.05352783203125\n",
" 296.7248229980469 297.19903564453125 297.5367736816406 297.294189453125\n",
" 297.6383972167969 293.062255859375 290.7722473144531 295.3493347167969\n",
" 296.72320556640625 296.6377258300781 296.75933837890625\n",
" 296.89935302734375 297.1480712890625 296.15484619140625 296.649658203125\n",
" 296.330322265625 296.31549072265625 297.60162353515625 297.9100036621094\n",
" 298.46514892578125 298.4815979003906 298.3829040527344 298.1735534667969\n",
" 297.87225341796875 298.0712890625 298.0290222167969 297.97064208984375\n",
" 298.06097412109375 298.2054748535156 297.97967529296875 298.169677734375\n",
" 298.4061279296875 298.37774658203125 298.5574035644531 298.3951416015625\n",
" 298.58740234375 298.7112731933594]\n"
]
}
],
"source": [
"print('North pole: ',ma_temp[0,:])\n",
"print('South pole: ',ma_temp[-1,:])\n",
"print('Equator: ',ma_temp[36,:])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>15.4 Take-Home Points</h2>\n",
"<ul>\n",
" <li>A masked array has a <b>mask</b> attribute that allows us to identify suspicious or unwanted data.</li>\n",
" <li>We can use direct assignment, assignment by condition, and filling to create a masked array.</li>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}