cft

Data Analyst Project: Business Decision Research - DQLab

The analysis of customer churn data in a fictional shop called DQLab sport center using Python.


user

Syiva M F

a year ago | 8 min read

Data Analytics Test

This data analytics test contains 2 parts, theory and test coding, each of which consists of:

  1. TheoryBasic Data Analytics Concepts: This test is intended to test members' understanding of data analytics.
  2. Coding TestData preparation test: This test is intended to test the member's ability to perform ETL data.Data visualization test: This test is intended to test the ability of members in terms of data visualization.Basic Stats Method test: This test is intended to test the member's ability to model data using basic statistics.

Market Research and Recommendation and Visualization Technique for Business Decision Making

DQLab sports center is a shop that sells various sports needs such as Jackets, Clothes, Bags, and Shoes. This shop started selling since 2013, so it has had regular customers for a long time, and is still trying to get new customers to this day.

In early 2019, the store manager recruited a junior DA to help solve the problem at his store, namely the decline in customers returning to his store. The Junior DA was also entrusted with processing the store's transaction data. The store manager defines that a customer is no longer called a customer (churn) when he has not transacted to his store again up to the last 6 months from the last available data update.

The store manager also provided transaction data from 2013 to 2019 in the form of csv (comma separated value) with data_retail.csv with 100,000 rows of data.

Here's what the data looks like:

The fields contained in the data include:

  1. No
  2. Row_Num
  3. Customer_ID
  4. Product
  5. First_Transaction
  6. Last_Transaction
  7. Average_Transaction_Amount
  8. Count_Transaction

Data analysis steps :

  1. Data preparation testImporting data: Importing data_retail.csv into the python environment.Cleansing data: Performs cleaning and modification of data so that it is ready to be used for further analysis.
  2. Data visualization test: Gain insight from the visualization results that have been created.
  3. Basic stats method test: Gain insight from the model and evaluate the model that has been created and tested.

Importing Data and Inspection

Import the dataset from https://storage.googleapis.com/dqlab-dataset/data_retail.csv

Then inspect the dataset with :

  1. print only the top five data
  2. print dataset info

Here's the dataset opened using the Text Editor:

Code :

import pandas as pd

df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/data_retail.csv',sep=';')

print('Top five data:')

print(df.head())

print('\nInfo dataset:')

print(df.info())

Output :

Top five data:

no Row_Num ... Average_Transaction_Amount Count_Transaction

0 1 1 ... 1467681 22

1 2 2 ... 1269337 41

2 3 3 ... 310915 30

3 4 4 ... 722632 27

4 5 5 ... 1775036 25

[5 rows x 8 columns]

Info dataset:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 100000 entries, 0 to 99999

Data columns (total 8 columns):

no 100000 non-null int64

Row_Num 100000 non-null int64

Customer_ID 100000 non-null int64

Product 100000 non-null object

First_Transaction 100000 non-null int64

Last_Transaction 100000 non-null int64

Average_Transaction_Amount 100000 non-null int64

Count_Transaction 100000 non-null int64

dtypes: int64(7), object(1)

memory usage: 6.1+ MB

None

Data Cleansing

The two columns that indicate the occurrence of the transaction are not of type datetime, so change them to the datetime data type. Then reprint the top 5 data from the df dataframe and also the data type of each column.

Code :

# First_Transaction

df['First_Transaction'] = pd.to_datetime(df['First_Transaction']/1000, unit='s', origin='1970-01-01')

# Last_Transaction

df['Last_Transaction'] = pd.to_datetime(df['Last_Transaction']/1000, unit='s', origin='1970-01-01')

print('Top five data:')

print(df.head())

print('\nInfo dataset:')

print(df.info())

Output :

Top five data:

no Row_Num ... Average_Transaction_Amount Count_Transaction

0 1 1 ... 1467681 22

1 2 2 ... 1269337 41

2 3 3 ... 310915 30

3 4 4 ... 722632 27

4 5 5 ... 1775036 25

[5 rows x 8 columns]

Info dataset:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 100000 entries, 0 to 99999

Data columns (total 8 columns):

no 100000 non-null int64

Row_Num 100000 non-null int64

Customer_ID 100000 non-null int64

Product 100000 non-null object

First_Transaction 100000 non-null datetime64[ns]

Last_Transaction 100000 non-null datetime64[ns]

Average_Transaction_Amount 100000 non-null int64

Count_Transaction 100000 non-null int64

dtypes: datetime64[ns](2), int64(5), object(1)

memory usage: 6.1+ MB

None

Churn Customers

To determine churn customers according to the given definition, look for :

  1. When was the most recent transaction made?
  2. Classify which customers are churn statuses and which are not

Code :

# Checking the last transaction in the dataset

print(max(df['Last_Transaction']))

# Classify customers who have churn status or not with boolean

df.loc[df['Last_Transaction'] <= '2018-08-01', 'is_churn'] = True

df.loc[df['Last_Transaction'] > '2018-08-01', 'is_churn'] = False

print('Top five data:')

print(df.head())

Output :

Top five data:

no Row_Num ... Count_Transaction is_churn

0 1 1 ... 22 False

1 2 2 ... 41 False

2 3 3 ... 30 False

3 4 4 ... 27 False

4 5 5 ... 25 False

[5 rows x 9 columns]

Delete unnecessary columns

Next, we will delete the unnecessary columns i.e. 'no' and 'Row_Num' columns.

Code :

del df['no']

del df['Row_Num']

print(df.head())

Output :

Customer_ID Product ... Average_Transaction_Amount Count_Transaction

0 29531 Jaket ... 1467681 22

1 29531 Sepatu ... 1269337 41

2 141526 Tas ... 310915 30

3 141526 Jaket ... 722632 27

4 37545 Sepatu ... 1775036 25

[5 rows x 6 columns]

Customer acquisition by year

The next step is to create a data visualization in the form of a trend of customer acquisition by year using a bar chart. Create an additional feature/column which is the year of First_Transaction and the year of Last_Transaction respectively named Year_First_Transaction and Year_Last_Transaction before visualizing.

Code :

df['Year_First_Transaction'] = df['First_Transaction'].dt.year

df['Year_Last_Transaction'] = df['Last_Transaction'].dt.year

import matplotlib.pyplot as plt

df_year = df.groupby(['Year_First_Transaction'])['Customer_ID'].count()

df_year.plot(x='Year_First_Transaction', y='Customer_ID', kind='bar', title='Graph of Customer Acquisition')

plt.xlabel('Year_First_Transaction')

plt.ylabel('Num_of_Customer')

plt.tight_layout()

plt.show()

Output :

Transactions by year

Visualize the trend of the number of transactions per year using a bar chart.

Code :

import matplotlib.pyplot as plt

plt.clf()

df_year = df.groupby(['Year_First_Transaction'])['Count_Transaction'].count()

df_year.plot(x='Year_First_Transaction', y='Count_Transaction', kind='bar', title='Graph of Transaction Customer')

plt.xlabel('Year_First_Transaction')

plt.ylabel('Num_of_Transaction')

plt.tight_layout()

plt.show()

Output :

Average transaction amount by year

Using the seaborn pointplot, visualize the trend from year to year in the average number of transactions for each product.

Code :

import matplotlib.pyplot as plt

import seaborn as sns

plt.clf()

sns.pointplot(data = df.groupby(['Product', 'Year_First_Transaction']).mean().reset_index(),

x='Year_First_Transaction',

y='Average_Transaction_Amount',

hue='Product')

plt.tight_layout()

plt.show()

Output:

Upvote


user
Created by

Syiva M F

I love the idea of being able to make better data which can be used for our future. Through development of personal projects and a valuable experiences, I have learned in methodologies and processes for managing large-scale databases, it makes me working comfortable with numbers and data. Extremely motivated to develop my skills and grow professionaly as a Data Analyst.


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles