Pandas Profiling – Turbocharge your Data Analysis

Published on October 14th, 2019

Pandas Profiling is the cool new tool for Data Analysis with Python. In this post, we will analyse the Airbnb data for Amsterdam, kindly provided by Inside Airbnb.

If you are running Anaconda, you can install the package with the following command in your command line (Windows) or terminal (MacOS / Linux):

conda install -c conda-forge pandas-profiling

Alternatively, you can use “pip”:

pip install pandas-profiling

For more information, check out the documentation.

The Python Code

The code can probably not get any simpler than this. Once you have created your dataframe, you only need to call “df.profile_report()” to generate your report. 

import pandas as pd
import pandas_profiling

url = ('http://data.insideairbnb.com/'
      'the-netherlands/north-holland/amsterdam/'
      '2019-08-08/visualisations/listings.csv')

df = pd.read_csv(url)

df.profile_report()

The code is also available on Github, for your convenience.

Pandas Profiling’s Results

The report consists of 5 sections:

      1. Overview
      2. Variables
      3. Correlations
      4. Missing values
      5. Samples

The first part, the overview, gives already a lot of insight into the dataset. In this example, we see there are 16 columns (variables) and ~20k rows (observations). There are columns we expect for Airbnb data, like price, number of reviews and minimum nights. The report immediately shows some import warnings. For example, you can see which columns miss data or contain a lot of zeros. You can even see which columns are highly skewed. Finally, the column “neighbourhood_group” is rejected, since it never has values. 

Pandas Profiling - Overview

Variables

In the report, you can scroll down to take a closer look at the variables. Amsterdam has 22 distinct neighbourhoods. The neighbourhood “De Baarsjes” has the most listings, around 3.5k. 

NeighbourhoodAirbnb - Neighbourhoods in Amsterdam

The prices widely vary from €0 to €8.915 per night. Free nights seem unlikely, so you might want to clean up this variable. Furthermore, what would you do with the extreme values, the outliers?

price

Airbnb Price in Amsterdam

Missing Values

Next in the report, you can find an overview of the missing values. There are a couple thousand listings without reviews. For the rest, the dataset looks complete.

Pandas Profiling - Missing Values

Correlations

Lastly, the report contains a nice and tidy overview of the correlation between the variables. Positive correlations are coded in red, negative in blue. In this dataset, there seems to be a small negative correlation between “price” and “number of reviews”. Intuitively, this does makes sense, because cheaper rooms probably get rented more often, hence get more reviews. Correlations

Conclusion – Pandas Profiling rocks!

The ‘Pandas Profiling’ package is a powerful tool for data analysis. With just a few lines of code, you get a very comprehensive report about the dataset. So you can get started quickly with working the fun stuff, working on your Python code.

Finally, if you like to learn more about Python, feel free to check out the Workshops in Amsterdam. 

Happy Coding!

Leave a Reply

Your email address will not be published.