05.28.2019 | Dr. Georg Walther | comment icon 0 Comment

Use statistical visualization to understand the needs of your customers

In order to better understand your data and come up with useful hypotheses, data visualization is a crucial part of every data science project. It is part of a process referred to as exploratory data analysis or EDA. In this blog article we will show how we can generate insights from data manipulation and visualization. We make use of openly available data provided by Santander Bank for a data science competition held some time ago.

Check out the Santander product recommendation challenge for further details.

In this article we make use of the statistical visualization library seaborn which is freely available as open source package for the programming language Python. The open source programming language Python is widely used by the data science / artificial intelligence community along with a plethora of open source libraries such as seaborn and pandas (we use the latter to load and manipulate the data).

Let us get started by loading the respective libraries:

 

The original Santander data set is fairly big (tens of millions of rows), which does not pose a technical or ressource challenge on present-day hardware (e.g. MacBook Pro with 16 GB RAM). While this amount of data does not pose a fundamental challenge to us it is not amenable to quick explorative data visualization.

To this end we use a sample of the data in the present article:

 

The tabular data provided by Santander encompasses snapshots of the financial products held by each customer (column customer_id) collected over multiple months (date column date_partition). Each financial product is represented by a separate column and whether or not a given product is owned by a customer is indicated by a binary 0 or 1 value:

date_partition customer_id employee_index residence product_home_account
2015-04-28 1155702 N ES 0
2016-02-28 235866 N ES 0
2015-04-28 1080714 N ES 0
2015-07-28 939105 N ES 0
2016-05-28 1248237 N ES 0

 

5 rows × 48 columns

 

Five of the 24 financial products represented in the data are:

Customer count

There are many summary metrics we can look at to start understanding the data. One of these is the development of the number of customers over time.

To this end we count the number of unique customer IDs for each date provided in the Santander data:

 

png

It appears Santander observed a spike of new customers around July 2015.

Income analysis

A key aspect of our customers we may be interested in is their median gross income. The Santander data set provides gross income for a number of customers. We remove those customers from our analysis whose gross income is not provided. We further only consider the gross income of those customers between ages 25 and 50.

 

Here we are interested in broad income trends across the age and residence of our customers. To this end we visualize the median income as a heatmap against both age and providene of residence. To simplify interpretation of this heatmap we focus on those provinces with lower overall median income.

 

png

The heatmap shows a broad trend of increasing median income top to bottom and with increasing age.

This analysis allows us to broadly understand the financial situation of our customers across the country and may guide us in approaching customers in different age groups and provinces with targeted offerings.

Demographic analysis

To drill deeper into our data we plot the age distribution of our customers. Achieving this is simple with the open source tools we use: We merely select the age column of our tabular data, df['age'], and call the distribution plot, distplot, functionality of the seaborn library:

 

png

The age distribution of customers has two modes (peaks): One mode around age 25 and another closer to 50.

To better understand the preferences and needs of our customers we will drill further into our data by splitting our customers into juniors (those below the age of 30) and seniors (starting age 30 and above). We will then compare the distribution of the most popular financial products between these two age groups.

 

png

Age-restricted financial products such as junior accounts (products_junior_account) are understandably heteregeneously distributed between the two age groups.

Not surprisingly, investment-related financial products (product_long_term_deposits, product_funds, and product_securities) are more popular in the more senior age group.

Statistical visualization like this helps us communicate with our customers more productively: We can communicate more frequently with our senior customers about products that are of interest to them. We could further attempt designing new financial products in the investment category that may also appeal to our more junior customers in order to increase our market share in that demographic.

Customer lifetime analysis

A key component of business succsess is how long customers stay with a business (customer lifetime). To analyze the impact of customer lifetime on our business we will now take a look at the correlation between customer lifetime and the number of financial products retained.

To this end we count the number of products held and number of customers in each lifetime bucket (column months_customer in our tabular data):

 

We then divide the number of products held by the number of customers in each lifetime bucket:

 

months_customer no_products no_customers relative_product_count
0 8242.0 13525 0.609390
1 11236.0 13117 0.856598
2 11826.0 12806 0.923473
3 12464.0 13129 0.949349
4 12334.0 12496 0.987036

 

Open-source library seaborn allows us to plot the customer-normalized / relative product count against lifetime buckets while simultaneously fitting a linear model to the data:

 

png

Somewhat unsurprisingly the above plot shows a general increase in the number of products retained by a customer with increasing customer lifetime.

Interestingly, the linear model we fit to this trend roughly indicates that customers who stay with us for an additional 100 months (about eight years) will purchase one additional financial product.

We should now drill further into our data to understand the additional revenue generated on average by each additional financial product purchased. Combining this with an in-depth churn model (a predictive model that tells us which customers are most likely to leave) allows us to efficiently use our ressources to retain those customers most likely to stay with us for at least another eight years thus likely generating additional revenue.

Product popularity and customer lifetime

We now have a rough understanding of the benefits of retaining customers – but what products should we recommend to our customers? Our above analysis showed clear trends in product preference in relation to customer age.

Here we will look at product preferences related to customer lifetime. To this end we will count the per-product customers and compute the relative product retention per lifetime bucket.

 

months_customer variable value no_customers relative_product_count
0 product_current_account 7576.0 13525 0.560148
0 product_payroll_account 55.0 13525 0.004067
0 product_junior_account 43.0 13525 0.003179
0 product_mas_particular_account 239.0 13525 0.017671
0 product_particular_account 0.0 13525 0.000000

 

png

In this array of plots we can see that customer interest in current accounts (product_current_account) plateaus with customer lifetime.

A number of other financial products (product_taxes, product_credit_card, etc.) show a steady increase in customer interest across their lifetime.

Interestingly, long term deposits (product_long_term_deposits) experience a sharp increase in interest around the 50 month customer lifetime bucket.

Well-placed communication about products with customers who show the greatest interest will likely help retain these customers and prolong their lifetime with us.

data science machine learning visualization
Dr. Georg Walther

Dr. Georg Walther

Georg holds degrees in biochemistry and computational biology from ETH Zurich, completed research at UBC Vancouver and holds a PhD degree from John Innes Centre. He has vast industry experience in the application of data science and machine learning methodologies.

Show all posts by Dr. Georg Walther

Leave a Comment