Table of Contents
Data Analysis
Console and old-school Tools
VisiData
VisiData can extract tabular data from webpages and is the best console-mode CSV data tool around.
- Metrics for columns: Use
I
to quickly get metrics for numeric columns. - Frequency Chart:
F
- Histograms:
O
to set “numeric binning”, then useF
- Scatter Plot:
.
Key | Meaning |
---|---|
I | Describe all columns, errors, distinct, mode, mean, median, stdev, etc. |
i | Add a column of incrementing numbers (useful for '.' charts) |
. | Requires an “important” numeric column for row data. Make chart. |
O | Options to enable “numeric_binning” and set number of “histogram_bins” (use 'e') |
F | Frequency table of row counts, or histogram if numeric_binning is true |
GnuPlot
GnuPlot is fine for generating text or image plots. Scatter plots and charts and histograms are easy.
Here's an example command given the following two files, data.csv and gnuplot_instructions.gpi
gnuplot -e "f='data.csv'" gnuplot_instructions.gpi
- data.csv
date,col1,col2 1992-01-01,5,14 1992-02-01,4,15 1992-03-01,6,16 1992-04-01,8,15.5 1992-05-01,7,16
And instructions, then you can run with a command like:
- gnuplot_instructions.gpi
# # For ASCII on one full screen # set term dumb `tput cols` `tput lines`*9/10 # Or, if you have gnuplot 6.0 and are using Deja Vu font, then... #set term block braille size `tput cols`,`tput lines`*9/10 # # For a PNG file. # # Launch in WSL: explorer.exe gnuplot.png # Launch in Linux: gio open gnuplot.png # #set term png size 1600,800; set output 'gnuplot.png' # # Set your X axis format # set timefmt '%Y-%m-%d' #set timefmt '%Y-%m-%d %H:%M' set xdata time set xlabel 'Date' set xtics "1992-01-01", 2629746 # start, increment in seconds #set ylabel 'Value' # # If you need to control tics or range # #set ytic 2 #set yrange [0:9] # # sep is whitespace by default. Set to ',' for csv # set datafile sep ',' set key autotitle columnhead # use the first line as title firstrow = system('head -1 '.f. ' | tr "_," "- "') set xlabel word(firstrow, 1) set ylabel word(firstrow, 2) # # Plot which columns which ways. # # lines, points, linespoints # #plot f using 1:4 with lines, f using 1:3 with linespoints #plot f using 1:2 with lines, f using 1:3 with linespoints title 'Legend 2' plot f using 1:2 with linespoints
If you're making a “histogram” (actually a box chart with histogram style on X,Y points)…
- gnuplot_instructions.gpi
# Mostly the same as above, until... # Set your X axis format set style histogram clustered gap 1 set style fill solid border -1 # Finally, plot with boxes plot f using 1:2 with boxes
Python Tools
There's a Hierarchy to the Python data analysis tools. You definitely need the most nested tools, maybe you don't need the IDE and Packager at the roots. It goes something like this:
- Anaconda Distribution: (Bigger snake than Python) packages Jupyter and everything you'd need.
- Jupyter: The interactive notebook with embedded images.
- Pandas: Read and sanitize data with read_csv(), to_datetime()
- NumPy: Fundamental, the other projects rely on it.
- Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
- Plotly: Generates interactive Javascript plots
Anaconda Distribution
No get. I think it's called Anaconda because that's literally a bigger snake than Python. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. There's a cloud-based trial with an Jupyter notebook.
Anaconda Navigator lets you launch Jupyter, R, VS Code, etc.
Jupyter
No get. Jupyter Notebook is open-source software and services for interactive computing across multiple programming languages. Interactive embedded charts as you work seems nice. Maybe you can try the online version.
Pandas
Get this. Read and sanitize data with read_csv(), to_datetime(). Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. MatPlotLib likes Pandas's dataframes (two dimensional tabular data).
NumPy
Get this. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation.
Plotly
Undecided whether to use this. See Matplotlib vs. Plotly: Let’s Decide Once and for All. Need to experiment. plotly.py is an interactive, open-source, and JavaScript-based graphing library for Python. Built on top of plotly.js, plotly.py is a high-level, declarative charting library that includes over 30 chart types, including scientific charts, 3D graphs, statistical charts, SVG maps, financial charts.
Matplotlib
Get this. Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy
Case Study: Temporal Series
Data AirPassengers.csv
VisiData
vd AirPassenges.csv
Key | Action |
---|---|
@ | Set Column one as date format |
! | Set Column one as “important” for the X-axis |
l | Navigate to column 2 |
# | Set data as integer format |
. | Create scatterplot |
Pros: Super fast and easy.
Cons: Need to use a font where Braille is supported. It's a scatterplot without lines.
GnuPlot
- AirPassengers.gpi
# For ASCII on one full screen #set term dumb `tput cols` `tput lines`*9/10 # If you have gnuplot 6.0 and are using Deja Vu font, then... #set term block braille size `tput cols`,`tput lines`*9/10 # For a PNG file. set term png size 900,400; set output 'AirPassengers.png' set timefmt '%Y-%m' set xdata time set format x '%Y' set key autotitle columnhead # use the first line for titles in legend set xlabel 'Year' # except for X, where we show Year not Month set datafile sep ',' # You can use: lines, points, linespoints plot 'AirPassengers.csv' using 1:2 with lines
gnuplot AirPassengers.gpi && explorer.exe AirPassengers.png
When you change term
to dumb, then depending on your terminal size you get output like:
700 +-------------------------------------------------------------------------------+ | + + + + + + + + + + + | | #Passengers ******* | | * | 600 |-+ *+-| | * * * | | ** * * | 500 |-+ * ** * *-| | ** * * * * | | * ** * * * * | | * * ** * *** *| 400 |-+ ** * * ** * *** +*| | * * * * * ** * ** * | | ** * * *** ***** *** | | ** ** *** * * * | 300 |-+ * * **** *** +-| | ** * * ** *** * | | ** ** * *** **** * | 200 |-+ * ** * *** * +-| | ** * * * *** ** * * | | ** ** * *** ** | |********* ** + + + + + + + + + | 100 +-------------------------------------------------------------------------------+ 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year
Pros: Fast and easy. Render to text or png pretty easily. Sometimes better text renderings than VisiData.
Cons: Not that pretty without customizations. GPI file takes some tweaking.
MatPlotLib
import pandas as pd data = pd.read_csv('AirPassengers.csv') data['Month'] = pd.to_datetime(data['Month']) data = data.set_index(['Month']) import matplotlib.pylab as plt plt.figure(figsize=(10,5)) plt.xlabel("Year") plt.ylabel("Airline Passengers") plt.plot(data) plt.show()
Pros: Theres so much more you can do.
Cons: Heavyweight.