====== Data Analysis ====== ===== Console and old-school Tools ===== ==== VisiData ==== [[vd]] can extract tabular data from webpages and is the best console-mode CSV data tool around. * **Metrics for columns**: Use ''I'' to quickly get metrics for numeric columns. * **Frequency Chart**: ''F'' * **Histograms**: ''O'' to set "numeric binning", then use ''F'' * **Scatter Plot**: ''.'' ^ Key ^ Meaning ^ | I | Describe all columns, errors, distinct, mode, mean, median, stdev, etc. | | i | Add a column of incrementing numbers (useful for '.' charts) | | . | Requires an "important" numeric column for row data. Make chart. | | O | Options to enable "numeric_binning" and set number of "histogram_bins" (use 'e') | | F | Frequency table of row counts, or histogram if numeric_binning is true | ==== GnuPlot ==== GnuPlot is fine for generating text or image plots. Scatter plots and charts and [[https://gnuplot.sourceforge.net/demo/histograms.html|histograms]] are easy. Here's an example command given the following two files, data.csv and gnuplot_instructions.gpi gnuplot -e "f='data.csv'" gnuplot_instructions.gpi date,col1,col2 1992-01-01,5,14 1992-02-01,4,15 1992-03-01,6,16 1992-04-01,8,15.5 1992-05-01,7,16 And instructions, then you can run with a command like: # # For ASCII on one full screen # set term dumb `tput cols` `tput lines`*9/10 # Or, if you have gnuplot 6.0 and are using Deja Vu font, then... #set term block braille size `tput cols`,`tput lines`*9/10 # # For a PNG file. # # Launch in WSL: explorer.exe gnuplot.png # Launch in Linux: gio open gnuplot.png # #set term png size 1600,800; set output 'gnuplot.png' # # Set your X axis format # set timefmt '%Y-%m-%d' #set timefmt '%Y-%m-%d %H:%M' set xdata time set xlabel 'Date' set xtics "1992-01-01", 2629746 # start, increment in seconds #set ylabel 'Value' # # If you need to control tics or range # #set ytic 2 #set yrange [0:9] # # sep is whitespace by default. Set to ',' for csv # set datafile sep ',' set key autotitle columnhead # use the first line as title firstrow = system('head -1 '.f. ' | tr "_," "- "') set xlabel word(firstrow, 1) set ylabel word(firstrow, 2) # # Plot which columns which ways. # # lines, points, linespoints # #plot f using 1:4 with lines, f using 1:3 with linespoints #plot f using 1:2 with lines, f using 1:3 with linespoints title 'Legend 2' plot f using 1:2 with linespoints If you're making a "histogram" (actually a box chart with histogram style on X,Y points)... # Mostly the same as above, until... # Set your X axis format set style histogram clustered gap 1 set style fill solid border -1 # Finally, plot with boxes plot f using 1:2 with boxes ===== Python Tools ===== There's a Hierarchy to the Python data analysis tools. You definitely need the most nested tools, maybe you don't need the IDE and Packager at the roots. It goes something like this: * Anaconda Distribution: (Bigger snake than Python) packages Jupyter and everything you'd need. * Jupyter: The interactive notebook with embedded images. * Pandas: Read and sanitize data with read_csv(), to_datetime() * NumPy: Fundamental, the other projects rely on it. * Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. * Plotly: Generates interactive Javascript plots Here's [[https://github.com/Manishms18/Air-Passengers-Time-Series-Analysis/blob/master/Air_Passenger_with_explanations.ipynb|an example analysis of AirPassengers over time]]. ==== Anaconda Distribution ==== **No get**. I think it's called Anaconda because that's literally a bigger snake than Python. Anaconda is a distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. There's a [[https://www.anaconda.com/code-in-the-cloud|cloud-based trial with an Jupyter notebook]]. Anaconda Navigator lets you launch Jupyter, R, VS Code, etc. ==== Jupyter ==== **No get**. Jupyter Notebook is open-source software and services for interactive computing across multiple programming languages. Interactive embedded charts as you work seems nice. Maybe you can try [[https://jupyter.org/try|the online version]]. ==== Pandas ==== **Get this**. Read and sanitize data with read_csv(), to_datetime(). Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. MatPlotLib likes Pandas's dataframes (two dimensional tabular data). ==== NumPy ==== **Get this**. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation. ==== Plotly ==== Undecided whether to use this. See [[https://towardsdatascience.com/matplotlib-vs-plotly-lets-decide-once-and-for-all-dc3eca9aa011|Matplotlib vs. Plotly: Let’s Decide Once and for All]]. Need to experiment. plotly.py is an interactive, open-source, and JavaScript-based graphing library for Python. Built on top of plotly.js, plotly.py is a high-level, declarative charting library that includes over 30 chart types, including scientific charts, 3D graphs, statistical charts, SVG maps, financial charts. ==== Matplotlib ==== **Get this**. Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy ====== Case Study: Temporal Series ====== Data [[https://www.kaggle.com/datasets/rakannimer/air-passengers|AirPassengers.csv]] ===== VisiData ===== vd AirPassenges.csv ^ Key ^ Action ^ | @ | Set Column one as date format | | ! | Set Column one as "important" for the X-axis | | l | Navigate to column 2 | | # | Set data as integer format | | . | Create scatterplot | {{:general:visidata_airpassengers.png?400|}} **Pros**: Super fast and easy. **Cons**: Need to use a font where Braille is supported. It's a scatterplot without lines. ===== GnuPlot ===== # For ASCII on one full screen #set term dumb `tput cols` `tput lines`*9/10 # If you have gnuplot 6.0 and are using Deja Vu font, then... #set term block braille size `tput cols`,`tput lines`*9/10 # For a PNG file. set term png size 900,400; set output 'AirPassengers.png' set timefmt '%Y-%m' set xdata time set format x '%Y' set key autotitle columnhead # use the first line for titles in legend set xlabel 'Year' # except for X, where we show Year not Month set datafile sep ',' # You can use: lines, points, linespoints plot 'AirPassengers.csv' using 1:2 with lines gnuplot AirPassengers.gpi && explorer.exe AirPassengers.png {{:general:airpassengers.png?400|}} When you change ''term'' to dumb, then depending on your terminal size you get output like: 700 +-------------------------------------------------------------------------------+ | + + + + + + + + + + + | | #Passengers ******* | | * | 600 |-+ *+-| | * * * | | ** * * | 500 |-+ * ** * *-| | ** * * * * | | * ** * * * * | | * * ** * *** *| 400 |-+ ** * * ** * *** +*| | * * * * * ** * ** * | | ** * * *** ***** *** | | ** ** *** * * * | 300 |-+ * * **** *** +-| | ** * * ** *** * | | ** ** * *** **** * | 200 |-+ * ** * *** * +-| | ** * * * *** ** * * | | ** ** * *** ** | |********* ** + + + + + + + + + | 100 +-------------------------------------------------------------------------------+ 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year **Pros**: Fast and easy. Render to text or png pretty easily. Sometimes better text renderings than VisiData. **Cons**: Not that pretty without customizations. GPI file takes some tweaking. ===== MatPlotLib ===== import pandas as pd data = pd.read_csv('AirPassengers.csv') data['Month'] = pd.to_datetime(data['Month']) data = data.set_index(['Month']) import matplotlib.pylab as plt plt.figure(figsize=(10,5)) plt.xlabel("Year") plt.ylabel("Airline Passengers") plt.plot(data) plt.show() {{:general:matplotlib_airpassengers.png?400|}} **Pros**: Theres [[https://github.com/Manishms18/Air-Passengers-Time-Series-Analysis/blob/master/Air_Passenger_with_explanations.ipynb|so much more you can do]]. **Cons**: Heavyweight.