Pandas Profiling is the cool new tool for Data Analysis with Python. In this post, we will analyse the Airbnb data for Amsterdam, kindly provided by Inside Airbnb.
If you are running Anaconda, you can install the package with the following command in your command line (Windows) or terminal (MacOS / Linux):
conda install -c conda-forge pandas-profiling
Alternatively, you can use “pip”:
pip install pandas-profiling
For more information, check out the documentation.
The Python Code
The code can probably not get any simpler than this. Once you have created your dataframe, you only need to call “df.profile_report()” to generate your report.
import pandas as pd import pandas_profiling url = ('http://data.insideairbnb.com/' 'the-netherlands/north-holland/amsterdam/' '2019-08-08/visualisations/listings.csv') df = pd.read_csv(url) df.profile_report()
The code is also available on Github, for your convenience.
Pandas Profiling’s Results
The report consists of 5 sections:
-
-
- Overview
- Variables
- Correlations
- Missing values
- Samples
-
The first part, the overview, gives already a lot of insight into the dataset. In this example, we see there are 16 columns (variables) and ~20k rows (observations). There are columns we expect for Airbnb data, like price, number of reviews and minimum nights. The report immediately shows some import warnings. For example, you can see which columns miss data or contain a lot of zeros. You can even see which columns are highly skewed. Finally, the column “neighbourhood_group” is rejected, since it never has values.
Variables
In the report, you can scroll down to take a closer look at the variables. Amsterdam has 22 distinct neighbourhoods. The neighbourhood “De Baarsjes” has the most listings, around 3.5k.
Neighbourhood
The prices widely vary from €0 to €8.915 per night. Free nights seem unlikely, so you might want to clean up this variable. Furthermore, what would you do with the extreme values, the outliers?
price
Missing Values
Next in the report, you can find an overview of the missing values. There are a couple thousand listings without reviews. For the rest, the dataset looks complete.
Correlations
Lastly, the report contains a nice and tidy overview of the correlation between the variables. Positive correlations are coded in red, negative in blue. In this dataset, there seems to be a small negative correlation between “price” and “number of reviews”. Intuitively, this does makes sense, because cheaper rooms probably get rented more often, hence get more reviews.
Conclusion – Pandas Profiling rocks!
The ‘Pandas Profiling’ package is a powerful tool for data analysis. With just a few lines of code, you get a very comprehensive report about the dataset. So you can get started quickly with working the fun stuff, working on your Python code.
Finally, if you like to learn more about Python, feel free to check out the Workshops in Amsterdam.
Happy Coding!