User Tools

Site Tools


data-analysis

This is an old revision of the document!


Data Analysis

Console and old-school Tools

VisiData

VisiData can extract tabular data from webpages and is the best console-mode CSV data tool around.

  • Metrics for columns: Use I to quickly get metrics for numeric columns.
  • Frequency Chart: F
  • Histograms: O to set “numeric binning”, then use F
  • Scatter Plot: .
Key Meaning
I Describe all columns, errors, distinct, mode, mean, median, stdev, etc.
i Add a column of incrementing numbers (useful for '.' charts)
. Requires an “important” numeric column for row data. Make chart.
O Options to enable “numeric_binning” and set number of “histogram_bins” (use 'e')
F Frequency table of row counts, or histogram if numeric_binning is true

GnuPlot

GnuPlot is fine for generating text or image plots. Scatter plots and charts and histograms are easy.

Here's an example command given the following two files, data.csv and gnuplot_instructions.gpi

gnuplot -e "f='data.csv'; t='Rating'" gnuplot_instructions.gpi
data.csv
1992-01-01,5,14
1992-02-01,4,15
1992-03-01,6,16
1992-04-01,8,15.5
1992-05-01,7,16

And instructions, then you can run with a command like:

gnuplot_instructions.gpi
#
# For ASCII on one full screen
#
set term dumb `tput cols` `tput lines`*9/10
 
#
# For a PNG file.
#
# Launch in WSL: explorer.exe gnuplot.png
# Launch in Linux: gio open gnuplot.png
#
#set term png size 1600,800; set output 'gnuplot.png'
 
#
# Set your X axis format
#
set timefmt '%Y-%m-%d'
#set timefmt '%Y-%m-%d %H:%M'
set xdata time
set xlabel 'Date'
set ylabel 'Value'
 
#
# If you need to control tics or range
#
#set ytic 2
#set yrange [0:9]
 
#
# sep is whitespace by default. Set to ',' for csv
#
set datafile sep ','
 
#
# Plot which columns which ways.
#
# lines, points, linespoints
#
#plot f using 1:4 with lines, f using 1:3 with linespoints
#plot f using 1:2 with lines title t, f using 1:3 with linespoints title 'Legend 2'
plot f using 1:2 with linespoints title t

Python Tools

There's a Hierarchy to the Python data analysis tools. You definitely need the most nested tools, maybe you don't need the IDE and Packager at the roots. It goes something like this:

  • Anaconda Distribution: (Bigger snake than Python) packages Jupyter and everything you'd need.
    • Jupyter: The interactive notebook with embedded images.
      • Pandas: Read and sanitize data with read_csv(), to_datetime()
        • NumPy: Fundamental, the other projects rely on it.
        • Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.

Here's an example analysis of AirPassengers over time.

Anaconda Distribution

No get. I think it's called Anaconda because that's literally a bigger snake than Python. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. There's a cloud-based trial with an Jupyter notebook.

Anaconda Navigator lets you launch Jupyter, R, VS Code, etc.

Jupyter

No get. Jupyter Notebook is open-source software and services for interactive computing across multiple programming languages. Interactive embedded charts as you work seems nice. Maybe you can try the online version.

Pandas

Get this. Read and sanitize data with read_csv(), to_datetime(). Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. MatPlotLib likes Pandas's dataframes (two dimensional tabular data).

NumPy

Get this. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation.

Matplotlib

Get this. Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy

data-analysis.1681544624.txt.gz · Last modified: 2023/04/15 00:43 by dblume