data chimp

data chimp is a programmable data analysis assistant that automatically shows contextual data visualizations, tables, and data docs as you work in your Jupyter notebook. Use it to help you spot unexpected features in your data, get oriented in a new data set quickly, or to enforce best practices on your team.

quick start

Connect to an existing jupyter kernel by running the %connect_info magic and clicking the connect icon in the cell containing this magic:

Once connected, click the "scatterplots" button within the data chimp view:

Then run the following code:

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

You should see a scatterplot for every combination of numeric columns within the titanic data set:

You can get the code that generated a particular visualization by pressing the Send Code button:

data chimp is configured via jupyter notebooks that live in the data_chimp directory within your workspace. Check out data_chimp/default.ipynb for a quick overview of the default configuration and to get a sense of how to modify it to suit your needs.

pro tip: move the data chimp view to your secondary side bar so you can see data chimp results AND any other sidebar view:

features

Quickly visualize your data w/ code-aware visualizations

The notebooks you keep in the data_chimp directory are turned into collections of auto-executing code chunks that you can enable at the press of a button within the data chimp view. These chunks can reference the data frame you are currently working with in your jupyter notebook (via the special df variable), and they can show data visualizations, tables, or even messages along side your analysis. It's also easy to get the code that generated a particular automatic result so you can iterate on it if needed. For an example, check out the quick start.

Automatically check for data quality or analysis issues

The cells of the default.ipynb notebook within the data_chimp directory are always executed while you're working with data frames. You can use this to automatically check for data quality or analysis issues while you work with data frames. The default configuration automatically shows a data frame containing columns with more than 3% missing values, and this functionality is enabled with a simple cell:

missing_df = (df
  .isnull()
  .mean()
  .round(4)
  .mul(100)
  .sort_values(ascending=False)
)
badly_missing_df = missing_df[missing_df > 3]
badly_missing_df if not badly_missing_df.empty else None

You can see this cell at work in the titanic data set:

Once you've installed data chimp, you can check out data_chimp/default.ipynb for more info.

Loop previous results for feedback as you wrangle your data

data chimp adds a loop button to each notebook cell toolbar. Once this button is pressed, the cell will automatically run each time you execute another cell, but the data frame referenced in the looped cell will be replaced with the data frame you're currently working with in the current cell.

For example, imagine you've found a typo in some penguins data by running df['species'].count_values():

df['species'].value_counts()

Gentoo        124
Adelie        114
Chinstrap      46
Adlie          38
Chinstrapp     22
Name: species, dtype: int64

As you fix these typos, you want to this table update so you can get feedback on the correctness of your string replace code. So, you can loop this result, and as you run your string replace code, it'll update:

Pull data docs into your notebook

With data chimp, it's easy to pull your data catalogue docs into your notebook. We already have a way to do this with dbt docs here, but you can easily adapt it to pull in docs from your data catalogue in just a few minutes. All you need to do is:

Download the notebook to your data_chimp directory.
(Optionally) tweak the notebook to pull data from an alternative data catalogue
Fill in the configuration params so it runs.
Enable it in the UI like this:

Requirements

Make sure you've already installed the VSCode Jupyter extension before using data chimp.

Release Notes

0.0.1

Public beta launch