whisk documentation home¶

whisk is an open-source data science project framework that makes collaboration, reproducibility, and deployment “just work”. It combines a data science-flavored Python project structure with a suite of lightweight tools. whisk lets you focus on data science but adds just enough structure to make the project easy to share.

Whisk doesn’t lock you into a particular ML framework or require you to work in a defined way. Instead, it lets you leverage the large Python ecosystem by structuring your data science project in a Pythonic-way. Whisk does the structuring while you focus on the data science.

Getting Started¶

Start by creating a project. Begin a terminal session and run the commands below. Note: We use demo as the project name in the examples below. If you use a different project name, be sure to replace demo with the name of your project.

$ pip install whisk
$ whisk create demo
$ cd demo
$ source venv/bin/activate

The commands above do the following:

Install the whisk package
Create a project named “demo”
Generate the project directory structure
Activate the project’s venv

To try out all of the features, continue the quick tour of whisk →.

Examples¶

The whisk-ml GitHub org contains example whisk projects. Check out these examples and clone them locally. Since whisk makes reproducibility “just work”, in most cases you simply need to run whisk setup to use the models generated by the projects. Here are few examples to start with:

Text Classification with Keras and Tensorflow - A model that predicts which tweets are about real disasters and which ones are not. This project uses DVC to version control the data download and training stages.
Image Classification with Tensorflow - A classifier to determine if an image is of a Mountain bike or a Road bike.

Documentation¶

Beliefs¶

whisk is not for everyone. However, if the beliefs below resonate with you, it might make sense for your next DS project:

A Reproducible, collaborative project is a solved problem for classical software - We don’t need to re-invent the wheel for machine learning projects. Instead, we need guide rails to help data scientists structure projects without forcing them to also become software engineers.
A notebook is great for exploring, but not for production - A data science notebook is where experimentation starts, but you can’t create a reproducible, collaborative project with just a *.ipynb file.
Optimize for debugging - 90% of writing software is fixing bugs. It should be fast and easy to debug your model logic locally. You should be able to search your error and find results, not sift through custom package source code or stop and restart Docker containers.
Python already has a good package manager - We don’t need overly abstracted solutions to package a trained ML model. A properly structured ML project lets you distribute the model via pip, making it easy for anyone to benefit from your work.
Version control is a requirement - You can’t have a reproducible project if the code and training data isn’t in version control.
Docker is an unsteady foundation - when we explicitly declare and isolate dependencies, we don’t need to rely on the implicit existence of packages installed in a Docker container. Python has solid native tools for dependency management.
Kubernetes is overkill - very few web applications require the complexity of a container-orchestration system. Your deployed model is no different. Most models can run on boring, reliable technology.

whisk compared to Cookiecutter Data Science¶

whisk takes the well-regarded structure created by Cookiecutter Data Science and sprinkles on some magic:

Graceful upgrades - It’s difficult, manual work upgrading a previously generated Cookiecutter DS project to the latest project structure. In fact, the most-commented open issue on the Cookiecutter package is How to update a project with changes from its cookiecutter? and has been open since 2016. Unlike Cookiecutter projects, whisk projects include the whisk package as a dependency. This shifts plumbing code (like CLI commands and helper functions) out of the project, reducing the amount of code and providing a clearer upgrade path as new features are added to whisk. whisk follows the same dependency model used popular web framework libraries like Django and Flask.
Environment configuration - Besides the directory structure, whisk also sets up a Python3 venv, initializes a Git repo for the project, creates an iPython kernel, and more. These are all standard activities I found myself doing over-and-over inside a CookieCutter DS-generated project.
Packaging - The code within the src/ directory of a whisk project can easily be packaged and distributed. This doesn’t work with Cookiecutter DS.
DVC - Cookiecutter DS includes some ad-hoc functions for handling data. whisk shifts these to DVC, a dedicated library.
Deployment - whisk adds a Flask app for model serving and a CLI command to deploy the web service to Heroku.

With 3.5k GitHub stars, Cookiecutter DS is a popular, well-received open-source project. If you are deciding between Cookiecutter DS and whisk, we’re already winning the war against sloppy data science.