Customer Supermarket Analysis

10 min readJul 21, 2023

Hi, Faisal here! Today, I’m gonna talk about customer supermarket analysis. Customer supermarket analysis is analyzing how customer’s behavior, for example how customer’s demographics, what products are often purchased, where they buy these products, whether directly in-store or online store, etc. One of the purpose of this analysis is if the supermarket wants to hold another campaign, so that the campaign becomes more effective.

Table of content:

Introduction
Problem Statement
Data Preparation
1. Dataset overview
2. Handling anomaly data
3. Features engineering
Data Analysis
1. Is the campaign effective enough?
2. How do customer demographics influence campaigns?
3. From the segmentation above, what products are most often purchased by customers?
4. In which places do customers make a lot of purchases?
Conclusion
Recommendation

Introduction
On this occasion, I just tried to explain the data analysis that I had done about customer supermarket analysis. The dataset can be accessed through this link!
So, this company has held a campaign/promotion, the company asks the data analyst team to further analyze the campaign/promotion that has been held. Later, this analysis will be taken into consideration by the company to hold the next campaign.

Problem Statement
Based on the campaigns that have been carried out, the company wants to know what kind of customer segments will accept the campaign or who will not accept the campaign. This will be useful for the company, so that the company can receive maximum benefits or it can be said what strategies will be used to have good effectiveness for the next campaign.

To achieve the above, we as a Data Analyst team will try to answer the following questions:
1. Is the campaign effective enough?
2. How do customer demographics influence campaigns?
3. From the segmentation above, what products are most often purchased by customers?
4. In which places do customers make a lot of purchases?

Dataset Overview (Data Preparation)
This data contains 29 columns and 2240 rows. These columns are divided into 4 categories, namely people, products, promotion, and place. People contains customer data, products contains data on the amount of money spent by customers on each product, and place contains data on the number of purchases at each place made by each customer.

Handling Anomaly Data (Data Preparation)
There is a 0 ID in the ID column. In some programming languages, this 0 (zero) can be read as a null value, to avoid this, I do handling for the 0 ID. I do the handling by generating a random number between the minimum ID and maximum ID using numpy random.

Then, there are missing values in the income column as many as 24 rows or 1%. However, because the dataset is also limited, rather than deleting the data, I will handle the missing values. Step by step handling the missing values in the income column:
1. I divide first into 2 DataFrames, the first DataFrame contains customer data that has no missing values in the income column, while the second DataFrame contains customer data that has missing values in the income column.
2. Based on the phik correlation, the income column has correlation and significance with the education and year_birth columns, so the missing value filling in the income column will be based on the education and year_birth columns too.
3. The first DataFrame is further divided into several ranges based on quantile to make it more fair.
4. Then, check the distribution of each range. Use mean/median as the central measure of tendency and standard deviation/median absolute deviation as the measure of spread.
5. Match the rows between NaN and non-NaN income, then fill the NaN income based on non-NaN income.

Furthermore, there are 3 outlier data in the year_birth column and they are handled by replacing the year_birth value with the mode. Step by step handling outlier data in the year_birth column:
1. Split 4 Data Frames divided by their quantile-quantiles
2. Find the mode in each quantile.
3. Then, find out which outlier data is in which quantile range.
4. Replace the outlier data with the mode found in step (2).

Feature Engineering (Data Preparation)

People Columns

In the people columns, I added an age column obtained from the result of subtracting the date when the customer joined as a member with their year of birth. Then, the age is discretized using domain knowledge. Next, I added the income_category column using the median method as the central tendency and MAD based on the income column. After that, adding the num_children column which is obtained from the addition of the Kidhome and Teenhome columns, from the num_children column is discretized into the has_child column (whether they have children or not).

2. Products Columns

In the Products columns, I added a total amount column obtained from adding the amount of each product for each customer. Then discretization is carried out whether it is high_spent or low-medium_spent (spent_category) using the median method as the central tendency and MAD.

3. Promotion Columns

Next, in the promotion column, I added an accept_campaign column that contains accept or never, where accept means that the customer has received the campaign at least once, while never means that the customer has never received the campaign.

Data Analysis

Is the campaign effective enough?
To answer this question, I conducted a statistical test using the proportion ztest two sample method. Where this statistical test is useful for knowing whether the total amount generated from customers who receive the campaign and are classified as high_spent is the same as customers who have never received the campaign or the total amount generated from customers who receive the campaign and are classified as high_spent is greater than customers who have never received the campaign. If it is greater, it can be said that the campaign is effective enough.

contingency table of accept campaign and spent category

assumption : whether a campaign is effective or not can be seen from the amount spent by the customer
- Ho: total amount of customer who accept the campaign = total amount of customer who NEVER accept the campaign
- Ha: total amount of customer who accept the campaign > total amount of customer who NEVER accept the campaign

significance level (alpha) = 0.5

# number of accept campaign and high spent
n_high_accept = df_pivot.loc['accept', 'high spent']

# number of never accept campaign and high spent
n_high_never = df_pivot.loc['never', 'high spent']

# number of accept customers
n_accept = df_pivot.sum(axis=1)['accept']

# number of never accept customers
n_never = df_pivot.sum(axis=1)['never']


# perform statistical test
from statsmodels.stats.proportion import proportions_ztest 

stat, pvalue = proportions_ztest(
    count= [n_high_accept, n_high_never],   # jumlah spent di kelompok accept dan kelompok never accept
    nobs= [n_accept, n_never],              # jumlah seluruh kelompok accept dan jumlah seluruh kelompok never accept
    alternative='larger'                    # Ha (lebih besar)
)

pvalue = pvalue/2

# conclusion

if pvalue > 0.05:
    print(f'''
    Pvalue = {pvalue}. Pvalue > 0.05. Menerima Ho.
    Kita punya cukup bukti untuk mengatakan bahwa campaign tidak efektif (total amount of customer who accept the campaign = total amount of customer who NEVER accept the campaign).
    (Proporsinya TIDAK berbeda signifikan)
    ''')

else:
    print(f'''
    Pvalue = {pvalue}. Pvalue <= 0.05. Menerima Ha.
    Kita punya cukup bukti untuk mengatakan bahwa campaign efektif (total amount of customer who accept the campaign > total amount of customer who NEVER accept the campaign).
    (Proporsinya berbeda signifikan)
    ''')

Output:
    Pvalue = 1.1890009998712533e-47. Pvalue <= 0.05. Menerima Ha.
    Kita punya cukup bukti untuk mengatakan bahwa campaign efektif (total amount of customer who accept the campaign > total amount of customer who NEVER accept the campaign).
    (Proporsinya berbeda signifikan)

Based on statistical tests, because the p-value is smaller than 0.05, it can be concluded that I have enough evidence to say that the total amount generated from customers who receive the campaign and are classified as high_spent is greater than customers who have never received the campaign or in other words, the campaign is quite effective.

2. How do customer demographics affect who accepts campaigns?

Age Category

Insight can be drawn that teenagers (13 years old through 17 years old) effective enough in accept campaign than other age categories. But due to limited data on the teenager group, for further analysis I took the old adults (older than 59 years old) age group for its age categories by 30.29%.

Based on the outlier significance graph, the old adults age group received more campaigns in the 5th campaign.

Marital Status Category

Insight can be drawn that widow effective enough in accept campaign than other marital status category by 35.06%.

Outlier Significance of Marital Status Category

Based on the outlier significance graph, widow marital status accept more campaigns on the last campaign.

Has Child/Not Category

Insight can be drawn that has no child effective enough in accept campaign than customer who has no child by 44.67%.

Trend of Has Child/Not Category in Campaign

Outlier Significance of Has Child/Not Category

Based on the outlier significance graph, has no child accept more campaigns in the 5th campaign.

Education Category

Insight can be drawn that PhD education level effective enough in accept campaign than other education level by 32.51%.

Outlier Significance of Education Category

Based on the outlier significance graph, PhDs receive more campaigns on the last campaign.

Income Category

In the income category, the ones who accept more campaigns are the high income group by 48.34%.

Based on the outlier significance graph, high income received more campaigns in the 5th campaign.

3. From that segmentation, what products are most often purchased by customers?

Across all categories, the 2 product types that generate the most amount are Wine and Meat.

For old adults, their average spend on Wine products is 513.16 to 696.72 and on Meat products is 162.82 to 268.39.
For customers with PhD level education and high income, their average spent on Wine products is 800.94 to 978.04 and on Meat products is 353.93 to 472.11.
For customers with marital status widow and has no children, their average spent on Wine products is 404.54 to 827.46 and on Meat products is 342.04 to 605.1.

4. In which places do customers make a lot of purchases?

The highest number of purchases made by customers is through the Store by 4,026 purchases.

Conclusion

If seen from the amount of money income for the company generated from the purchase of products by customers, it can be said that the campaign has been effective.
Based on the age category, education level, marital status, having children/no children, and income category, it can be divided into 3 segmentation groups that accept campaigns:
a. Group 1 in terms of income category and education level → customers who accept the campaign are in the high income category and the education level is PhD.
b. Group 2 in terms of marital status and having children / not → customers who accept the campaign are in the widow category and do not have children.
c. Group 3 in terms of age category -> customers who accept the campaign are in the old adults category (over 59 years old)
The products that bring in the most money for the company are Wine and Meat.
Customers who accept the campaign make the most purchases at the store directly with 4026 purchases.

Recommendation
If the company wants to do another campaign, then the campaign can be prioritized for customers:

who are over 59 years old,
high income,
PhD education level,
widow marital status (or single), and
have no children.

It is hoped that the company can conduct the campaign at least 5 times, because based on analysis, most customers receive the 5th or last campaign. Then, in terms of products, the company can conduct campaigns (promotions) for Wine and Meat products, because these products are often purchased by customers. Furthermore, campaigns can be carried out at the store directly.

For complete block of code and analysis, y’all can visit my repository here on my GitHub profile! Or if you want to see my data visualization in my tableau, click here!

It’s really nice to see you reading my article till the end, so thank you for that, and I hope you’re doing great! See you on my next one!!!

Customer Supermarket Analysis

Written by Faisal Nur K

No responses yet