Data Analytics Test

This data analytics test contains 2 parts, theory and test coding, each of which consists of:

TheoryBasic Data Analytics Concepts: This test is intended to test members' understanding of data analytics.
Coding TestData preparation test: This test is intended to test the member's ability to perform ETL data.Data visualization test: This test is intended to test the ability of members in terms of data visualization.Basic Stats Method test: This test is intended to test the member's ability to model data using basic statistics.

Market Research and Recommendation and Visualization Technique for Business Decision Making

DQLab sports center is a shop that sells various sports needs such as Jackets, Clothes, Bags, and Shoes. This shop started selling since 2013, so it has had regular customers for a long time, and is still trying to get new customers to this day.

In early 2019, the store manager recruited a junior DA to help solve the problem at his store, namely the decline in customers returning to his store. The Junior DA was also entrusted with processing the store's transaction data. The store manager defines that a customer is no longer called a customer (churn) when he has not transacted to his store again up to the last 6 months from the last available data update.

The store manager also provided transaction data from 2013 to 2019 in the form of csv (comma separated value) with data_retail.csv with 100,000 rows of data.

Here's what the data looks like:

The fields contained in the data include:

No
Row_Num
Customer_ID
Product
First_Transaction
Last_Transaction
Average_Transaction_Amount
Count_Transaction

Data analysis steps :

Data preparation testImporting data: Importing data_retail.csv into the python environment.Cleansing data: Performs cleaning and modification of data so that it is ready to be used for further analysis.
Data visualization test: Gain insight from the visualization results that have been created.
Basic stats method test: Gain insight from the model and evaluate the model that has been created and tested.

Importing Data and Inspection

Import the dataset from https://storage.googleapis.com/dqlab-dataset/data_retail.csv

Then inspect the dataset with :

print only the top five data
print dataset info

Here's the dataset opened using the Text Editor:

Code :


import pandas as pd
df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/data_retail.csv',sep=';')
print('Top five data:')
print(df.head())
print('\nInfo dataset:')
print(df.info())

Output :


Top five data:
   no  Row_Num  ...  Average_Transaction_Amount Count_Transaction
0   1        1  ...                     1467681                22
1   2        2  ...                     1269337                41
2   3        3  ...                      310915                30
3   4        4  ...                      722632                27
4   5        5  ...                     1775036                25
[5 rows x 8 columns]
Info dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
no                            100000 non-null int64
Row_Num                       100000 non-null int64
Customer_ID                   100000 non-null int64
Product                       100000 non-null object
First_Transaction             100000 non-null int64
Last_Transaction              100000 non-null int64
Average_Transaction_Amount    100000 non-null int64
Count_Transaction             100000 non-null int64
dtypes: int64(7), object(1)
memory usage: 6.1+ MB
None

Data Cleansing

The two columns that indicate the occurrence of the transaction are not of type datetime, so change them to the datetime data type. Then reprint the top 5 data from the df dataframe and also the data type of each column.

Code :


# First_Transaction
df['First_Transaction'] = pd.to_datetime(df['First_Transaction']/1000, unit='s', origin='1970-01-01')
# Last_Transaction
df['Last_Transaction'] = pd.to_datetime(df['Last_Transaction']/1000, unit='s', origin='1970-01-01')
print('Top five data:')
print(df.head())
print('\nInfo dataset:')
print(df.info())

Output :


Top five data:
   no  Row_Num  ...  Average_Transaction_Amount Count_Transaction
0   1        1  ...                     1467681                22
1   2        2  ...                     1269337                41
2   3        3  ...                      310915                30
3   4        4  ...                      722632                27
4   5        5  ...                     1775036                25
[5 rows x 8 columns]
Info dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
no                            100000 non-null int64
Row_Num                       100000 non-null int64
Customer_ID                   100000 non-null int64
Product                       100000 non-null object
First_Transaction             100000 non-null datetime64[ns]
Last_Transaction              100000 non-null datetime64[ns]
Average_Transaction_Amount    100000 non-null int64
Count_Transaction             100000 non-null int64
dtypes: datetime64[ns](2), int64(5), object(1)
memory usage: 6.1+ MB
None

Churn Customers

To determine churn customers according to the given definition, look for :

When was the most recent transaction made?
Classify which customers are churn statuses and which are not

Code :


# Checking the last transaction in the dataset
print(max(df['Last_Transaction']))
# Classify customers who have churn status or not with boolean
df.loc[df['Last_Transaction'] <= '2018-08-01', 'is_churn'] = True
df.loc[df['Last_Transaction'] > '2018-08-01', 'is_churn'] = False
print('Top five data:')
print(df.head())

Output :


Top five data:
   no  Row_Num  ...  Count_Transaction is_churn
0   1        1  ...                 22    False
1   2        2  ...                 41    False
2   3        3  ...                 30    False
3   4        4  ...                 27    False
4   5        5  ...                 25    False
[5 rows x 9 columns]

Delete unnecessary columns

Next, we will delete the unnecessary columns i.e. 'no' and 'Row_Num' columns.

Code :


del df['no']
del df['Row_Num']
print(df.head())

Output :


   Customer_ID Product  ... Average_Transaction_Amount Count_Transaction
0        29531   Jaket  ...                    1467681                22
1        29531  Sepatu  ...                    1269337                41
2       141526     Tas  ...                     310915                30
3       141526   Jaket  ...                     722632                27
4        37545  Sepatu  ...                    1775036                25
[5 rows x 6 columns]

Customer acquisition by year

The next step is to create a data visualization in the form of a trend of customer acquisition by year using a bar chart. Create an additional feature/column which is the year of First_Transaction and the year of Last_Transaction respectively named Year_First_Transaction and Year_Last_Transaction before visualizing.

Code :


df['Year_First_Transaction'] = df['First_Transaction'].dt.year
df['Year_Last_Transaction'] = df['Last_Transaction'].dt.year
import matplotlib.pyplot as plt
df_year = df.groupby(['Year_First_Transaction'])['Customer_ID'].count()
df_year.plot(x='Year_First_Transaction', y='Customer_ID', kind='bar', title='Graph of Customer Acquisition')
plt.xlabel('Year_First_Transaction')
plt.ylabel('Num_of_Customer')
plt.tight_layout()
plt.show()

Output :

Transactions by year

Visualize the trend of the number of transactions per year using a bar chart.

Code :


import matplotlib.pyplot as plt
plt.clf()
df_year = df.groupby(['Year_First_Transaction'])['Count_Transaction'].count()
df_year.plot(x='Year_First_Transaction', y='Count_Transaction', kind='bar', title='Graph of Transaction Customer')
plt.xlabel('Year_First_Transaction')
plt.ylabel('Num_of_Transaction')
plt.tight_layout()
plt.show()

Output :

Average transaction amount by year

Using the seaborn pointplot, visualize the trend from year to year in the average number of transactions for each product.

Code :


import matplotlib.pyplot as plt
import seaborn as sns
plt.clf()
sns.pointplot(data = df.groupby(['Product', 'Year_First_Transaction']).mean().reset_index(),
			  x='Year_First_Transaction',
			  y='Average_Transaction_Amount',
			  hue='Product')
plt.tight_layout()
plt.show()

Output: