Cleaning Up Debt: A Python pandas Approach
This is a cross post done by tryb Product Development intern, Finn Qiao. Finn is a graduate student at McCormick School of Engineering – Northwestern University where he is using Python and pandas in data science and analytics.
Data Wrangling and Exploratory Data Analysis of Non-Performing Loan (NPL) Data
A recent analysis of the Southeast Asian credit markets at work brought to light some nuances regarding credit culture in this particular region. The data got me thinking about global credit health and non-performing loan rates globally. Given my recent foray into data analysis, I thought a notebook was in order.
But first, what is the credit market and what are non-performing loans?
By examining the credit market, we are looking at the market in which individuals, corporations, and governments seek to borrow money from investors. This form of borrowing can take on many different forms, from your daily credit card expenses to mortgages on your next beach home.
Non-performing loans (NPL) refer to loans in which the borrower has not been able to make scheduled repayments for some time (usually over 90 days).
When NPL levels rise, it is often an indicator of deteriorating credit market health. In short, more people are not paying back what they owe.
We’ll examine NPL levels across different countries from the World Bank Databank.
First, we start by importing the necessary packages…
As there are some rows with filler column information, I skip the first 4 rows with islice
.
Missing data
A look at the dataframe
seems to indicate that there are lots of missing data, especially for earlier years. To get a better picture of what data is usually missing, we use missingno
to get a quick visualization.
Before using missingno
however, it appears that empty data is represented by numpy
strings. Thus we look to replace all such instances with np.nan
for the missingno
visualization to work.
Any white space indicates missing data
The visualization seems to indicate large blocks of missing data, especially with the annual data up to 2010. 2016 seems to have more missing data relative to the 6 years prior. Nations that are missing data for the 2010–2015 period are thus dropped.
Furthermore, the columns for ‘Indicator name’ and ‘Indicator code’ don’t seem particularly useful.
We’ll create a new dataframe
with only the country, country code, and data for years 2010–2015. We will also set the column names, which currently reside in the first row.
No missing fields now
Data types
Much better. That leaves 144 countries/regions with values in every column. Some row indices might not be country specific. But first, there is the issue of data types.
Examining the datatypes of the columns, all the data appear to be non-null objects. For the purpose of analyzing NPL levels that are meant to represent percentages, these should be of type float.
After changing the data types to float
, the describe()
function works again and it seems that the mean values stay within a relatively narrow range. There seems to be a peculiar outlier for the minimum NPL for years 2010 and 2011. Can NPL be at 0%? Seems like a fantasy world.
Tonga seems to be the odd one out, with NPL jumping to 14% in 2012. Perhaps they only opened up their credit markets in 2012? A quick Google search debunks that claim. It is likely to be an erroneous entry and we will drop Tonga from the analysis.
‘Excess’ data
A look at the ‘Country Name’ column reveals 143 ‘countries’ but many of them are referring to specific regions and classifications like ‘South Asia (IDA & IBRD)’. We only want pure country data, so we’ll import in a list of known country names from country_list
and filter them out.
There are many methods to go about cleaning up the country list. More elegant methods include finding near matches and filtering by score. For example, OUseful gives a good example of using fuzzy set to find near matches.
However, given that this is a relatively small set of index values, we could simply look at the countries that were filtered out when cross referenced with the country list.
It seems that of the 44 values removed, the country names of ‘Bosnia and Herzegovina’, ‘Brunei Darussalam’, ‘Congo, Rep.’, ‘Czech Republic’, ‘Gambia, The’, ‘Hong Kong SAR, China’, ‘Kyrgyz Republic’, ‘Macao SAR, China’, ‘Macedonia, FYR’, ‘Slovak Republic’, ‘Trinidad and Tobago’ should have remained.
We then proceed to use list comprehension to find our final list of filtered countries.
The wonders of list comprehension
Now that we have the countries that we want in our dataframe
, lets add in some of the other attributes given by this same data set by the World Bank. In a separate CSV, they have provided information on the region the country is in and the income group of its population. A merge can be done on the country codes (the common column between the files).
Melting
Great, now we have a list of countries with the relevant information. However, the format of the annual NPL data still mirrors that of raw data entries. EDA and further analyses would be easier with a ‘tidy’ data structure. The column names from 2010 to 2015 could be more useful as values rather than column names.
Thus, we are going to ‘melt’ the dataset. This is also known as going from wide format to long format.
Long form data for Argentina
Adding categories
We might have the data we want in a dataframe
but does that mean it is ready for analysis? A quick look at info()
gives a resounding ‘no’.
The ‘year’ column is practically useless as a non-null object if we were to conduct any sort of time series analysis so it should be either converted to an integer or a datetime
object.
Furthermore, while you could group the data by income groups, it might be more telling if we attach a scale to it. While this scalar transformation isn’t quite as applicable to regions, one can reliably weight income groups.
Now that the data set is cleaner, lets look at what the data tells us.
Distribution of NPL
A first look at the distribution for NPL levels indicates that it is not quite a normal distribution but a skewed one. A calculation of skewness and kurtosis values seems to support this, with both deviating from the ‘normal’ range.
It is important to take note of such values for skewness and peakedness as they can greatly impact any predictive models.
Scatterplots and identifying clusters
Next, we look at the distribution of these NPL levels over the income groups. The points are then color coded to represent the year.
I had initially expected NPL to be much higher for lower income groups so the most prominent thing that stood out was the large group of outliers for the highest income group countries. Who were these bad actors on the debt market?
It seems that the ‘bad actor’ list is dominated by a few countries such as Ireland, Greece, San Marino, and Cyprus. Despite their relatively high income levels, they possess some of the highest NPL levels.
These NPL levels map quite well to real life events, with Ireland falling back to recession in 2012, Cyprus having a financial crisis in 2012–2013, San Marino’s financial sector being hit hard by the Eurozone crisis, and Greeks facing their massive debt crisis since 2009.
Credit transparency
Could the amount of data we have on each region be an indicator of credit market transparency? The countries in our dataframe are the ones with adequate NPL information over recent years. Let’s compare the relative representation of each region.
It seems that the ‘Europe & Central Asia’ and ‘North America’ areas are the most transparent for NPL data. Most other regions hover around the 50% mark whereas ‘Middle East & North Africa’ is a notable laggard in terms of credit transparency with NPL data for only 19% of countries.
Next steps
An initial exploration of NPL data globally yielded some pretty interesting results. However, given that the main purpose of this process was to tidy the data from the World Bank, there is still much room for exploration. I hope to aggregate other lines of data into this dataframe
and perhaps run a panel data analysis.
It would definitely have been interesting to look into local factors within each region/country to better quantify the credit culture. Credit culture can be a relatively arbitrary term and most quantitative measures of it are based around repayment history and other loan data. It would be interesting to factor “soft” metrics into the examination of credit culture. These ‘softer’ metrics could include examination of cultural views on debt, aggressiveness of loan taking, community trust, and the concept of ego and ‘face’ in borrowing or owing money.
Check out the Jupyter Notebook here if you’re interested in the code!