User Tools

Site Tools


data-analysis

Data Analysis

Console and old-school Tools

VisiData

VisiData can extract tabular data from webpages and is the best console-mode CSV data tool around.

  • Metrics for columns: Use I to quickly get metrics for numeric columns.
  • Frequency Chart: F
  • Histograms: O to set “numeric binning”, then use F
  • Scatter Plot: .
Key Meaning
I Describe all columns, errors, distinct, mode, mean, median, stdev, etc.
i Add a column of incrementing numbers (useful for '.' charts)
. Requires an “important” numeric column for row data. Make chart.
O Options to enable “numeric_binning” and set number of “histogram_bins” (use 'e')
F Frequency table of row counts, or histogram if numeric_binning is true

GnuPlot

GnuPlot is fine for generating text or image plots. Scatter plots and charts and histograms are easy.

Here's an example command given the following two files, data.csv and gnuplot_instructions.gpi

gnuplot -e "f='data.csv'" gnuplot_instructions.gpi
data.csv
date,col1,col2
1992-01-01,5,14
1992-02-01,4,15
1992-03-01,6,16
1992-04-01,8,15.5
1992-05-01,7,16

And instructions, then you can run with a command like:

gnuplot_instructions.gpi
#
# For ASCII on one full screen
#
set term dumb `tput cols` `tput lines`*9/10
# Or, if you have gnuplot 6.0 and are using Deja Vu font, then...
#set term block braille size `tput cols`,`tput lines`*9/10
 
#
# For a PNG file.
#
# Launch in WSL: explorer.exe gnuplot.png
# Launch in Linux: gio open gnuplot.png
#
#set term png size 1600,800; set output 'gnuplot.png'
 
#
# Set your X axis format
#
set timefmt '%Y-%m-%d'
#set timefmt '%Y-%m-%d %H:%M'
set xdata time
set xlabel 'Date'
set xtics "1992-01-01", 2629746  # start, increment in seconds
#set ylabel 'Value'
 
#
# If you need to control tics or range
#
#set ytic 2
#set yrange [0:9]
 
#
# sep is whitespace by default. Set to ',' for csv
#
set datafile sep ','
set key autotitle columnhead  # use the first line as title
firstrow = system('head -1 '.f. ' | tr "_," "-  "')
set xlabel word(firstrow, 1)
set ylabel word(firstrow, 2)
 
#
# Plot which columns which ways.
#
# lines, points, linespoints
#
#plot f using 1:4 with lines, f using 1:3 with linespoints
#plot f using 1:2 with lines, f using 1:3 with linespoints title 'Legend 2'
plot f using 1:2 with linespoints

If you're making a “histogram” (actually a box chart with histogram style on X,Y points)…

gnuplot_instructions.gpi
# Mostly the same as above, until...
 
# Set your X axis format
set style histogram clustered gap 1
set style fill solid border -1
 
# Finally, plot with boxes
plot f using 1:2 with boxes

Python Tools

There's a Hierarchy to the Python data analysis tools. You definitely need the most nested tools, maybe you don't need the IDE and Packager at the roots. It goes something like this:

  • Anaconda Distribution: (Bigger snake than Python) packages Jupyter and everything you'd need.
    • Jupyter: The interactive notebook with embedded images.
      • Pandas: Read and sanitize data with read_csv(), to_datetime()
        • NumPy: Fundamental, the other projects rely on it.
        • Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
    • Plotly: Generates interactive Javascript plots

Here's an example analysis of AirPassengers over time.

Anaconda Distribution

No get. I think it's called Anaconda because that's literally a bigger snake than Python. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. There's a cloud-based trial with an Jupyter notebook.

Anaconda Navigator lets you launch Jupyter, R, VS Code, etc.

Jupyter

No get. Jupyter Notebook is open-source software and services for interactive computing across multiple programming languages. Interactive embedded charts as you work seems nice. Maybe you can try the online version.

Pandas

Get this. Read and sanitize data with read_csv(), to_datetime(). Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. MatPlotLib likes Pandas's dataframes (two dimensional tabular data).

NumPy

Get this. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation.

Plotly

Undecided whether to use this. See Matplotlib vs. Plotly: Let’s Decide Once and for All. Need to experiment. plotly.py is an interactive, open-source, and JavaScript-based graphing library for Python. Built on top of plotly.js, plotly.py is a high-level, declarative charting library that includes over 30 chart types, including scientific charts, 3D graphs, statistical charts, SVG maps, financial charts.

Matplotlib

Get this. Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy

Case Study: Temporal Series

VisiData

vd AirPassenges.csv
Key Action
@ Set Column one as date format
! Set Column one as “important” for the X-axis
l Navigate to column 2
# Set data as integer format
. Create scatterplot

Pros: Super fast and easy.
Cons: Need to use a font where Braille is supported. It's a scatterplot without lines.

GnuPlot

AirPassengers.gpi
# For ASCII on one full screen
#set term dumb `tput cols` `tput lines`*9/10
 
# If you have gnuplot 6.0 and are using Deja Vu font, then...
#set term block braille size `tput cols`,`tput lines`*9/10
 
# For a PNG file.
set term png size 900,400; set output 'AirPassengers.png'
 
set timefmt '%Y-%m'
set xdata time
set format x '%Y'
set key autotitle columnhead  # use the first line for titles in legend
set xlabel 'Year'             # except for X, where we show Year not Month
set datafile sep ','
 
# You can use: lines, points, linespoints
plot 'AirPassengers.csv' using 1:2 with lines
gnuplot AirPassengers.gpi && explorer.exe AirPassengers.png

When you change term to dumb, then depending on your terminal size you get output like:

700 +-------------------------------------------------------------------------------+
    |      +     +      +      +     +      +      +     +      +      +     +      |
    |                                                           #Passengers ******* |
    |                                                                            *  |
600 |-+                                                                          *+-|
    |                                                                      *    * * |
    |                                                                     **    * * |
500 |-+                                                             *     **    * *-|
    |                                                              **    * *    * * |
    |                                                        *     **    * *   *  * |
    |                                                       * *    **    *  ***    *|
400 |-+                                               **    * *    **   *   ***   +*|
    |                                          *     * *    * *  **  * **   *       |
    |                                          **    * *  *** *****  ***            |
    |                                          **   **  ***    * *    *             |
300 |-+                                  *     * ****   ***                       +-|
    |                             **    * *  **  ***    *                           |
    |                      **   ** *  *** ****    *                                 |
200 |-+              *     **  *    ***    *                                      +-|
    |         **  * * * ***  **     * *                                             |
    |  **   ** * ***  **                                                            |
    |*********  **      +      +     +      +      +     +      +      +     +      |
100 +-------------------------------------------------------------------------------+
  1949   1950  1951   1952   1953  1954   1955   1956  1957   1958   1959  1960   1961
                                          Year

Pros: Fast and easy. Render to text or png pretty easily. Sometimes better text renderings than VisiData.
Cons: Not that pretty without customizations. GPI file takes some tweaking.

MatPlotLib

import pandas as pd
data = pd.read_csv('AirPassengers.csv')
data['Month'] = pd.to_datetime(data['Month'])
data = data.set_index(['Month'])
 
import matplotlib.pylab as plt
plt.figure(figsize=(10,5))
plt.xlabel("Year")
plt.ylabel("Airline Passengers")
plt.plot(data)
plt.show()

Pros: Theres so much more you can do.
Cons: Heavyweight.

data-analysis.txt · Last modified: 2024/05/06 22:22 by dblume