This is an old revision of the document!
Table of Contents
Data Analysis
Console and old-school Tools
VisiData
VisiData can extract tabular data from webpages and is the best console-mode CSV data tool around.
- Metrics for columns: Use
I
to quickly get metrics for numeric columns. - Frequency Chart:
F
- Histograms:
O
to set “numeric binning”, then useF
- Scatter Plot:
.
Key | Meaning |
---|---|
I | Describe all columns, errors, distinct, mode, mean, median, stdev, etc. |
i | Add a column of incrementing numbers (useful for '.' charts) |
. | Requires an “important” numeric column for row data. Make chart. |
O | Options to enable “numeric_binning” and set number of “histogram_bins” (use 'e') |
F | Frequency table of row counts, or histogram if numeric_binning is true |
GnuPlot
GnuPlot is fine for generating text or image plots. Scatter plots and charts and histograms are easy.
Here's an example command given the following two files, data.csv and gnuplot_instructions.gpi
gnuplot -e "f='data.csv'; t='Rating'" gnuplot_instructions.gpi
- data.csv
1992-01-01,5,14 1992-02-01,4,15 1992-03-01,6,16 1992-04-01,8,15.5 1992-05-01,7,16
And instructions, then you can run with a command like:
- gnuplot_instructions.gpi
# # For ASCII on one full screen # set term dumb `tput cols` `tput lines`*9/10 # # For a PNG file. # # Launch in WSL: explorer.exe gnuplot.png # Launch in Linux: gio open gnuplot.png # #set term png size 1600,800; set output 'gnuplot.png' # # Set your X axis format # set timefmt '%Y-%m-%d' #set timefmt '%Y-%m-%d %H:%M' set xdata time set xlabel 'Date' set ylabel 'Value' # # If you need to control tics or range # #set ytic 2 #set yrange [0:9] # # sep is whitespace by default. Set to ',' for csv # set datafile sep ',' # # Plot which columns which ways. # # lines, points, linespoints # #plot f using 1:4 with lines, f using 1:3 with linespoints #plot f using 1:2 with lines title t, f using 1:3 with linespoints title 'Legend 2' plot f using 1:2 with linespoints title t
Python Tools
There's a Hierarchy to the Python data analysis tools. You definitely need the most nested tools, maybe you don't need the IDE and Packager at the roots. It goes something like this:
- Anaconda Distribution: (Bigger snake than Python) packages Jupyter and everything you'd need.
- Jupyter: The interactive notebook with embedded images.
- Pandas: Read and sanitize data with read_csv(), to_datetime()
- NumPy: Fundamental, the other projects rely on it.
- Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
Anaconda Distribution
No get. I think it's called Anaconda because that's literally a bigger snake than Python. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. There's a cloud-based trial with an Jupyter notebook.
Anaconda Navigator lets you launch Jupyter, R, VS Code, etc.
Jupyter
No get. Jupyter Notebook is open-source software and services for interactive computing across multiple programming languages. Interactive embedded charts as you work seems nice. Maybe you can try the online version.
Pandas
Get this. Read and sanitize data with read_csv(), to_datetime(). Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. MatPlotLib likes Pandas's dataframes (two dimensional tabular data).
NumPy
Get this. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation.
Matplotlib
Get this. Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy
Case Study: Temporal Series
Data AirPassengers.csv
VisiData
vd AirPassenges.csv
Key | Action |
---|---|
@ | Set Column one as date format |
! | Set Column one as “important” for the X-axis |
l | Navigate to column 2 |
# | Set data as integer format |
. | Create scatterplot |
Pros: Super fast and easy.
Cons: Need to use a font where Braille is supported. It's a scatterplot without lines.
GnuPlot
- AirPassengers.gpi
# For ASCII on one full screen #set term dumb `tput cols` `tput lines`*9/10 # For a PNG file. set term png size 900,400; set output 'AirPassengers.png' set timefmt '%Y-%m' set xdata time set format x '%Y' set xlabel 'Year' set ylabel 'Passengers' set datafile sep ',' # You can use: lines, points, linespoints plot 'AirPassengers.csv' using 1:2 with lines title 'Airline Passengers'
gnuplot AirPassengers.gpi && explorer.exe AirPassengers.png
When you change term
to dumb, then depending on your terminal size you get output like:
700 +-------------------------------------------------------------------------------+ | + + + + + + + + + + + | | Airline Passengers ******* | | * | 600 |-+ *+-| | * * * | | ** * * | 500 |-+ * ** * *-| | ** * * * * | | * ** * * * * | | * * ** * *** *| 400 |-+ ** * * ** * *** +*| | * * * * * ** * ** * | | ** * * *** ***** *** | | ** ** *** * * * | 300 |-+ * * **** *** +-| | ** * * ** *** * | | ** ** * *** **** * | 200 |-+ * ** * *** * +-| | ** * * * *** ** * * | | ** ** * *** ** | |********* ** + + + + + + + + + | 100 +-------------------------------------------------------------------------------+ 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year
Pros: Fast and easy. Render to text or png pretty easily. Sometimes better text renderings than VisiData.
Cons: Not that pretty without customizations. GPI file takes some tweaking.
MatPlotLib
import pandas as pd data = pd.read_csv('AirPassengers.csv') data['Month'] = pd.to_datetime(data['Month']) data = data.set_index(['Month']) import matplotlib.pylab as plt plt.figure(figsize=(10,5)) plt.xlabel("Year") plt.ylabel("Airline Passengers") plt.plot(data) plt.show()
Pros: Theres so much more you can do.
Cons: Heavyweight.