User Tools

Site Tools


data-analysis

This is an old revision of the document!


Data Analysis

Console and old-school Tools

VisiData

VisiData can extract tabular data from webpages and is the best console-mode CSV data tool around.

  • Metrics for columns: Use I to quickly get metrics for numeric columns.
  • Frequency Chart: F
  • Histograms: O to set “numeric binning”, then use F
  • Scatter Plot: .
Key Meaning
I Describe all columns, errors, distinct, mode, mean, median, stdev, etc.
i Add a column of incrementing numbers (useful for '.' charts)
. Requires an “important” numeric column for row data. Make chart.
O Options to enable “numeric_binning” and set number of “histogram_bins” (use 'e')
F Frequency table of row counts, or histogram if numeric_binning is true

GnuPlot

GnuPlot is fine for generating text or image plots. Scatter plots and charts and histograms are easy.

Here's an example command given the following two files, data.csv and gnuplot_instructions.gpi

gnuplot -e "f='data.csv'; t='Rating'" gnuplot_instructions.gpi
data.csv
1992-01-01,5,14
1992-02-01,4,15
1992-03-01,6,16
1992-04-01,8,15.5
1992-05-01,7,16

And instructions, then you can run with a command like:

gnuplot_instructions.gpi
#
# For ASCII on one full screen
#
set term dumb `tput cols` `tput lines`*9/10
 
#
# For a PNG file.
#
# Launch in WSL: explorer.exe gnuplot.png
# Launch in Linux: gio open gnuplot.png
#
#set term png size 1600,800; set output 'gnuplot.png'
 
#
# Set your X axis format
#
set timefmt '%Y-%m-%d'
#set timefmt '%Y-%m-%d %H:%M'
set xdata time
set xlabel 'Date'
set ylabel 'Value'
 
#
# If you need to control tics or range
#
#set ytic 2
#set yrange [0:9]
 
#
# sep is whitespace by default. Set to ',' for csv
#
set datafile sep ','
 
#
# Plot which columns which ways.
#
# lines, points, linespoints
#
#plot f using 1:4 with lines, f using 1:3 with linespoints
#plot f using 1:2 with lines title t, f using 1:3 with linespoints title 'Legend 2'
plot f using 1:2 with linespoints title t

Python Tools

There's a Hierarchy to the Python data analysis tools. You definitely need the most nested tools, maybe you don't need the IDE and Packager at the roots. It goes something like this:

  • Anaconda Distribution: (Bigger snake than Python) packages Jupyter and everything you'd need.
    • Jupyter: The interactive notebook with embedded images.
      • Pandas: Read and sanitize data with read_csv(), to_datetime()
        • NumPy: Fundamental, the other projects rely on it.
        • Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.

Here's an example analysis of AirPassengers over time.

Anaconda Distribution

No get. I think it's called Anaconda because that's literally a bigger snake than Python. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. There's a cloud-based trial with an Jupyter notebook.

Anaconda Navigator lets you launch Jupyter, R, VS Code, etc.

Jupyter

No get. Jupyter Notebook is open-source software and services for interactive computing across multiple programming languages. Interactive embedded charts as you work seems nice. Maybe you can try the online version.

Pandas

Get this. Read and sanitize data with read_csv(), to_datetime(). Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. MatPlotLib likes Pandas's dataframes (two dimensional tabular data).

NumPy

Get this. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation.

Matplotlib

Get this. Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy

Case Study: Temporal Series

VisiData

vd AirPassenges.csv
Key Action
@ Set Column one as date format
! Set Column one as “important” for the X-axis
l Navigate to column 2
# Set data as integer format
. Create scatterplot

Pros: Super fast and easy.
Cons: Need to use a font where Braille is supported. It's a scatterplot without lines.

GnuPlot

AirPassengers.gpi
# For ASCII on one full screen
#set term dumb `tput cols` `tput lines`*9/10
 
# For a PNG file.
set term png size 900,400; set output 'AirPassengers.png'
 
set timefmt '%Y-%m'
set xdata time
set format x '%Y'
set xlabel 'Year'
set ylabel 'Passengers'
set datafile sep ','
 
# You can use: lines, points, linespoints
plot 'AirPassengers.csv' using 1:2 with lines title 'Airline Passengers'
gnuplot AirPassengers.gpi && explorer.exe AirPassengers.png

Pros: Fast and easy. Render to text or png pretty easily.
Cons: Not that pretty. GPI file takes some tweaking.

MatPlotLib

import pandas as pd
data = pd.read_csv('AirPassengers.csv')
data['Month'] = pd.to_datetime(data['Month'])
data = data.set_index(['Month'])
 
import matplotlib.pylab as plt
plt.figure(figsize=(10,5))
plt.xlabel("Year")
plt.ylabel("Airline Passengers")
plt.plot(data)
plt.show()

Pros: Theres so much more you can do.
Cons: Heavyweight.

data-analysis.1681586012.txt.gz · Last modified: 2023/04/15 12:13 by dblume