User Tools

Site Tools



VisiData is an interactive CLI table explorer with vim key bindings. Here is a really great tutorial. These are notes to myself.

Source Table Recommendations

Gather data with more granularity than you think you need. You often need three tables:

  1. Logs: Include date, build (includes platform ID), product ID, params of log, total count, unique devices.
  2. Device Activity: Includes date, build, product ID, active time, active device count.
  3. Mappings from product ID to product name: (Maybe from platform ID or from product ID.)

Open VisiData and Set Data Types

$ vd logs.csv active_devices.csv id_name.csv

Set Column Types, Widths, Names, Importance

Key Meaning
_ Adjust widths of columns
@ Column type is date.
! Column is important. (Will be key for table merging.)
# Column is integer. (Set aggregator to sum.)
+ Set aggregator, usually to “sum”
; Extract regex to new column. Ex, (video|audio), (^..) or (^([STUVW]...|..))
^ rename the column. Might have to be “product_id” or “platform_id”
= Use Python function to create new column. Ex, hex to dec: int(curcol,16)
: Split column by regex
- Hide column
S Go to “Sheets” sheet, to select another sheet to format.
q Leave the sheet. (gq quit all sheets)
F Make a Frequency Table for the selected column. (gF for selected columns)
U Undo. R Redo.
d Delete row. (I only use this on Sheets sheet.)
, Select all rows that match this column's value
Open duplicate sheet with only selected rows

Inspecting Columnar Data

Key Meaning
I Describe all columns, errors, distinct, mode, mean, median, stdev, etc.
i Add a column of incrementing numbers (useful for '.' charts)
. Requires an “important” numeric column for row data. Make chart.
O Options to enable “numeric_binning” and set number of “histogram_bins” (use 'e')
F Frequency table of row counts, or histogram if numeric_binning is true

Calculating a percentage-of-total column for a numeric column:

Key Meaning
# Set column type to “int”.
I Describe all columns. (Highlight the “sum” cell for the column of interest.)
~ Convert that column to “text” so it'll be copied correctly for pasting later.
zy Yank the value of the sum.
q Quit the Describe sheet.
= New column. Enter curcol/, use Ctrl+y to paste the column sum value.

Case Study Link: Exported CSV from PG&E

Cast Study: Merging Two Tables, logs and metadata

Protip: Use column view to set multiple columns at once

Key Meaning
C Go to column view, each row is data for a column.
t Select rows, select all integer rows.
ge Go to “type” column, enter ge (global edit), type “int”
Go to “aggregators”, enter ge, type “sum”
q Leave that table.

For example, your “logs.csv” would end up looking like:

Your active_devices.csv would look like this:

Join Sheets

Use ! to set Key columns. Use F (or gF) to remove dupes of Keys by making a Frequency Table.

Key Meaning
The key columns to join have to match names. Check and rename with ^.
S Go to Sheets sheet
t (toggle) Select the sheets you want to join by key columns.
& (or g&) Join sheets. Type “inner” for an inner join.
gu Next time you're in Sheets, “global unselect” to unselect the sheets you selected.

Protip: Remove insignificant noisy rows

Key Meaning
z| Select rows matching Python expression. Type “blits > 10000”
Make a new sheet with selected rows

And the merge of the logs and active devices sheets would look like:

Do a similar thing to join the Joined Sheet with the product Name sheet.

Process Data

In your final Joined sheet, add rows that are calculations of other rows.

Key Meaning
HL (JK for rows) Move columns
- Remove columns
Maybe use C to assure desired columns are ints and aggregators are “sum”
! Toggle key columns.
gF global Frequency Table so key columns are unique.
= Add column. Enter count_sum * 100 / active_devices_sum
% Set new column as float.
_ Resize columns
] Sort by column
. (g. for multiple columns) Make a graph based on Key column.

Now you should have some basic info to keep exploring. When you want to save what you've done:

Key Meaning
Ctrl+s Save sheet
Ctrl+d Save command log (cmdlog). Use extension “.vd”. You can replay what you've done later!

The commandlog is a CSV file too. The first lines are the open file commands. So to replay, you only have to specify the command log like so:

$ vd --play=my_cmdlog.vd --replay-wait=0.5

Lists in Cells for Frequency Tables

Sometimes you want one of the columns in a Frequency Table to be a list of unique values. Let's say the column title is “my_column”, then:

Key Meaning
+ Set the aggregator to “List”
F Make a Frequency Table for the selected column. (gF for selected columns)
=, ','.join(set(my_column)) Create a new column of a comma delimited Python Set of cell entries.
vd.txt · Last modified: 2024/05/13 11:23 by dblume