====== Data Analysis ======
===== Console and old-school Tools =====
==== VisiData ====
[[vd]] can extract tabular data from webpages and is the best console-mode CSV data tool around.
* **Metrics for columns**: Use ''I'' to quickly get metrics for numeric columns.
* **Frequency Chart**: ''F''
* **Histograms**: ''O'' to set "numeric binning", then use ''F''
* **Scatter Plot**: ''.''
^ Key ^ Meaning ^
| I | Describe all columns, errors, distinct, mode, mean, median, stdev, etc. |
| i | Add a column of incrementing numbers (useful for '.' charts) |
| . | Requires an "important" numeric column for row data. Make chart. |
| O | Options to enable "numeric_binning" and set number of "histogram_bins" (use 'e') |
| F | Frequency table of row counts, or histogram if numeric_binning is true |
==== GnuPlot ====
GnuPlot is fine for generating text or image plots. Scatter plots and charts and [[https://gnuplot.sourceforge.net/demo/histograms.html|histograms]] are easy.
Here's an example command given the following two files, data.csv and gnuplot_instructions.gpi
gnuplot -e "f='data.csv'" gnuplot_instructions.gpi
date,col1,col2
1992-01-01,5,14
1992-02-01,4,15
1992-03-01,6,16
1992-04-01,8,15.5
1992-05-01,7,16
And instructions, then you can run with a command like:
#
# For ASCII on one full screen
#
set term dumb `tput cols` `tput lines`*9/10
# Or, if you have gnuplot 6.0 and are using Deja Vu font, then...
#set term block braille size `tput cols`,`tput lines`*9/10
#
# For a PNG file.
#
# Launch in WSL: explorer.exe gnuplot.png
# Launch in Linux: gio open gnuplot.png
#
#set term png size 1600,800; set output 'gnuplot.png'
#
# Set your X axis format
#
set timefmt '%Y-%m-%d'
#set timefmt '%Y-%m-%d %H:%M'
set xdata time
set xlabel 'Date'
set xtics "1992-01-01", 2629746 # start, increment in seconds
#set ylabel 'Value'
#
# If you need to control tics or range
#
#set ytic 2
#set yrange [0:9]
#
# sep is whitespace by default. Set to ',' for csv
#
set datafile sep ','
set key autotitle columnhead # use the first line as title
firstrow = system('head -1 '.f. ' | tr "_," "- "')
set xlabel word(firstrow, 1)
set ylabel word(firstrow, 2)
#
# Plot which columns which ways.
#
# lines, points, linespoints
#
#plot f using 1:4 with lines, f using 1:3 with linespoints
#plot f using 1:2 with lines, f using 1:3 with linespoints title 'Legend 2'
plot f using 1:2 with linespoints
If you're making a "histogram" (actually a box chart with histogram style on X,Y points)...
# Mostly the same as above, until...
# Set your X axis format
set style histogram clustered gap 1
set style fill solid border -1
# Finally, plot with boxes
plot f using 1:2 with boxes
===== Python Tools =====
There's a Hierarchy to the Python data analysis tools. You definitely need the most nested tools, maybe you don't need the IDE and Packager at the roots. It goes something like this:
* Anaconda Distribution: (Bigger snake than Python) packages Jupyter and everything you'd need.
* Jupyter: The interactive notebook with embedded images.
* Pandas: Read and sanitize data with read_csv(), to_datetime()
* NumPy: Fundamental, the other projects rely on it.
* Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
* Plotly: Generates interactive Javascript plots
Here's [[https://github.com/Manishms18/Air-Passengers-Time-Series-Analysis/blob/master/Air_Passenger_with_explanations.ipynb|an example analysis of AirPassengers over time]].
==== Anaconda Distribution ====
**No get**. I think it's called Anaconda because that's literally a bigger snake than Python. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. There's a [[https://www.anaconda.com/code-in-the-cloud|cloud-based trial with an Jupyter notebook]].
Anaconda Navigator lets you launch Jupyter, R, VS Code, etc.
==== Jupyter ====
**No get**. Jupyter Notebook is open-source software and services for interactive computing across multiple programming languages. Interactive embedded charts as you work seems nice. Maybe you can try [[https://jupyter.org/try|the online version]].
==== Pandas ====
**Get this**. Read and sanitize data with read_csv(), to_datetime(). Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. MatPlotLib likes Pandas's dataframes (two dimensional tabular data).
==== NumPy ====
**Get this**. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation.
==== Plotly ====
Undecided whether to use this. See [[https://towardsdatascience.com/matplotlib-vs-plotly-lets-decide-once-and-for-all-dc3eca9aa011|Matplotlib vs. Plotly: Let’s Decide Once and for All]]. Need to experiment. plotly.py is an interactive, open-source, and JavaScript-based graphing library for Python. Built on top of plotly.js, plotly.py is a high-level, declarative charting library that includes over 30 chart types, including scientific charts, 3D graphs, statistical charts, SVG maps, financial charts.
==== Matplotlib ====
**Get this**. Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy
====== Case Study: Temporal Series ======
Data [[https://www.kaggle.com/datasets/rakannimer/air-passengers|AirPassengers.csv]]
===== VisiData =====
vd AirPassenges.csv
^ Key ^ Action ^
| @ | Set Column one as date format |
| ! | Set Column one as "important" for the X-axis |
| l | Navigate to column 2 |
| # | Set data as integer format |
| . | Create scatterplot |
{{:general:visidata_airpassengers.png?400|}}
**Pros**: Super fast and easy.
**Cons**: Need to use a font where Braille is supported. It's a scatterplot without lines.
===== GnuPlot =====
# For ASCII on one full screen
#set term dumb `tput cols` `tput lines`*9/10
# If you have gnuplot 6.0 and are using Deja Vu font, then...
#set term block braille size `tput cols`,`tput lines`*9/10
# For a PNG file.
set term png size 900,400; set output 'AirPassengers.png'
set timefmt '%Y-%m'
set xdata time
set format x '%Y'
set key autotitle columnhead # use the first line for titles in legend
set xlabel 'Year' # except for X, where we show Year not Month
set datafile sep ','
# You can use: lines, points, linespoints
plot 'AirPassengers.csv' using 1:2 with lines
gnuplot AirPassengers.gpi && explorer.exe AirPassengers.png
{{:general:airpassengers.png?400|}}
When you change ''term'' to dumb, then depending on your terminal size you get output like:
700 +-------------------------------------------------------------------------------+
| + + + + + + + + + + + |
| #Passengers ******* |
| * |
600 |-+ *+-|
| * * * |
| ** * * |
500 |-+ * ** * *-|
| ** * * * * |
| * ** * * * * |
| * * ** * *** *|
400 |-+ ** * * ** * *** +*|
| * * * * * ** * ** * |
| ** * * *** ***** *** |
| ** ** *** * * * |
300 |-+ * * **** *** +-|
| ** * * ** *** * |
| ** ** * *** **** * |
200 |-+ * ** * *** * +-|
| ** * * * *** ** * * |
| ** ** * *** ** |
|********* ** + + + + + + + + + |
100 +-------------------------------------------------------------------------------+
1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961
Year
**Pros**: Fast and easy. Render to text or png pretty easily. Sometimes better text renderings than VisiData.
**Cons**: Not that pretty without customizations. GPI file takes some tweaking.
===== MatPlotLib =====
import pandas as pd
data = pd.read_csv('AirPassengers.csv')
data['Month'] = pd.to_datetime(data['Month'])
data = data.set_index(['Month'])
import matplotlib.pylab as plt
plt.figure(figsize=(10,5))
plt.xlabel("Year")
plt.ylabel("Airline Passengers")
plt.plot(data)
plt.show()
{{:general:matplotlib_airpassengers.png?400|}}
**Pros**: Theres [[https://github.com/Manishms18/Air-Passengers-Time-Series-Analysis/blob/master/Air_Passenger_with_explanations.ipynb|so much more you can do]].
**Cons**: Heavyweight.