Exploratory Data Analysis (EDA) defines the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of statistical summary and graphical representations.


From a retail perspective, it is the first step to understand the nature of a business. It also gives a telescopic view of vital signs which helps to define a problem statement, reason, and the procedure to solve a specific problem.

Overall it has three phases, follow the sequence.

  1. Data visualization: Visualization of data that should be simple to read.
  2. Define data: Description of data and define characteristics of business if not already defined like Lapsation period, Retention rate, Urban/Rural, and more.
  3. Conclusions: See what is going well and what can be improved.

Exploratory data analysis flow


  1. Initial checks and defining time period of analysis: It is not possible to conclude the health of a business by just analyzing one cycle of the business or by just analyzing historical data. Hence, before starting with any analysis, it is very important to define the time period of analysis. The time period of analysis should be relevant to the current day and should be long enough to analyze the trend. Ideally, it should be the date of analysis – 2*Lapsation period.
  2. Preparing dataset for analysis: The Exploratory Data Analysis(EDA) looks into all dimensions of data. The most practical way is to prepare different datasets for all dimensions and use them separately.
    The following are the examples.
    1. Transactional data
    2. Customer single view
    3. Demography data
    4. Loyalty data

The following is the flow of exploratory data analysis.

  1. Top Line view or Overall performance snapshot: The Exploratory Data Analysis(EDA) should always start with a performance snapshot that is clear and simple showing all key performance indicators performance in a given time period (customer's performance compared to last year).
    The following are the KPIs to be shown. 
    • Total sales
    • Customer base (as to date)
    • Loyalty sales
    • %Loyalty sales
    • Repeat sales
    • Enrolments
    • Enrolment efficiency
    • Customer transacted
    • ABS
    • ATV
  2. Customer Snapshot: This gives a complete view of all existing customers of an org/brand and categorizes customers in different buckets like One timer and Repeaters(when the customer visits the store more than once). It captures the snapshot of customers' Demographics.
  3. Define lapsation period of the org/brand: Different nature of business has different customer latency. Hence we need to define the Lapsation period for the brand we are analyzing as it helps in categorizing customers. We just need to see what cumulative % of repeaters repeats in what time interval. Usually, the threshold is 80%.
  4. Org/Brand's retention and drop rate: Check the stickiness of the org/brand. You can check it by looking at the drop rate curve. These are the vital signs of a business that shows the customer retention with an org/brand (based on customer's visit).
  5. Loyalty revenue trend and drivers: There are only a few drivers of loyalty sales hence it is important to have a grip on those drivers. Here you can see the trend of loyalty revenue and how it is affected by its drivers (positively and negatively) so that you can concentrate on a particular driver to tweak the revenue positively. This section includes a lot of data descriptions and explains what went well and what did not. This should be followed by a trend of drivers and its description which needed to be in sync with the previous description.
    Loyalty sales = Customer shopped * ATV * Bill per customer
    ATV = ABS * APP
  6. Product analysis: This analysis shows what, where, and when products are sold. It also helps orgs/brands to stock products in a strategic manner. You can see the distribution of sold products and their contribution to sales in product categories and subcategories (different dimensions of the product).
  7. Regional/zonal analysis: Check if the characteristics of business change based on geographic location. This analysis lets you understand what works well in which region, helping to create the regional or zonal business strategy. This analysis also includes a comparison of KPIs YoY(year over year), zone to zone, and sold product distribution in different regions or zones.
  8. Loyalty overview: This overview shows the top view of the existing loyalty program of an org/brand.
    Points to look at - 
    1.  Earn to burn Rate,  
    2. Redemption rate, and 
    3. Redemption sale per point redeemed.
  9. Campaign overview: This shows the overview of marketing and loyalty campaigns conducted by org/brand. Loyalty overview helps orgs/brands to review their campaign strategy and validate marketing or CRM spends.
    Points to look at - 
    1. CRM impact (incremental sale),
    2. ROI multiple (Redemption sale or campaign spend),
    3. What works (top five campaigns based on hit rate), and
    4. What does not work (bottom five campaigns).
  10. Conclusion and things to work on (Pain Points): Based on analysis and industry-standard, define the problem areas of the org/brand in pointers. Also, define the strong points of the org/brand and what works for the org/brand. positively, and how to maintain the same.
  11. Way forward: Based on pain points mention the next action plan to fix the issues.

Steps to run Notebook

The Databricks notebook has different commands where codes are written. The Cmd{number} represents that command line that you need to refer to.

Prerequisite to run the notebook

  1. Clone the Notebook into your workspace.
  2. Open your cluster-specific link provided for the Notebook.
    NotebooksCluster links
    Exploratory data analysisIndia  SEA
    Exploratory data analysis - OU level enhancementIndia  SEA

To run the notebook, follow the steps.

  1. Cmd 1: Import all the required python libraries.
  2. Cmd 2: Read instructions on how to use the notebook.
  3. Input the following data into the text box:
    1. Org_id, Start date, End date, Active Period (obtained from the lapsation curve), Lapsation Period (according to the org/brand).
    2. For product analysis:
      Product Database name, Product table name, Product item code column name (used to join with bill_lineitems), Product Category Column name.
    3. For campaign analysis:
      Input Campaign start date (should not be more than one year from the said start date).
  4. Cmd 5-11: Importing datasets for analysis, the data import is done and stored in pandas datasets which can be used for visualization and insight generation.
    1. Cmd 5: All transactional level data for the last three years is imported and stored in a dataset named summary_3_years.
    2. Cmd 6-7: These are optional and should only be used if you intend to use demographic data in your analysis.
    3. Cmd 8 and 9: Importing YoY data for all relative KPIs which can be compared and examined later.
    4. Cmd 10: Creating a single customer where you have all the details of a customer in a single row.
    5. Cmd 11: Importing Month on month customer register with the org/brand.
  5. Cmd 13-15: Preparing data for Waterfall Chart
    1. Cmd 13 – Importing data to be presented in waterfall chart (last two calendar years), change years if you want to compare the current year with last year.
    2. Cmd 14-15: Data manipulation (Do not change).
  6. Cmd  17-18: Preparing topline number data (YoY comparison).
  7. Cmd 21: Set the agenda of the dashboard with the help of markup language (%md).
  8. Cmd 23-24:  Visualizing YoY Data, (do not change until you know how to use Matplotlib and seaborn packages)
  9. Cmd 25: Visualizing waterfall chart.
  10. Cmd 26: Provide observations from waterfall chart with the help of markup language.
  11. Cmd 27-29: Orgs/Brands top-line view, Data preparation, Displaying brand topline view in tabular format and commentary on same.
  12. Cmd 30-34: Loyalty vs non-loyalty, Data manipulation and visualization of YoY data in terms of loyalty and non-loyalty.
  13. Cmd 41-45: Define the lapsation period of org/brand.
    1. Cmd 42-43: Data manipulation
    2. Cmd 44: Visualizing the lapsation curve, provide the lapsation period as per curve (Point of latency where %customer reaches 80%).  Read comments on the command cell.
    3. Cmd 45: Define the lapsation period of brand and set rules for active, lapsed, and lost status.
  14. Cmd 46-49: Checking org/brand retention
    1. Cmd 47: Data manipulation
    2. Cmd 48: Visualize drop rate curve. Provide comment on the graph as per visuals.
    3. Cmd 49: Provide overall comment on org/brand retention.
  15. Cmd 50-54: Customer snapshot
    1. Cmd 50: Create recency buckets(active, lapsed, and lost ) based on the lapsation period as defined in cmd 45, Provide periods for recency buckets.
    2. Cmd 51-52: Data manipulation(no change required).
    3. Cmd 53-54: Visualization of the distribution of customers across recency and frequency buckets (make changes in cmd 54 as per absolute values).
  16. Cmd 55-60: Data manipulation (No Change required).
  17. Cmd 61-67:  Month-over-month(MoM) visualization of all major KPIs.
  18. Cmd 68-80: Importing, manipulation, and visualization of campaign data.
  19. Cmd 81-87: Manipulation and visualization of store-level data(no change required).
  20. Cmd-88-92: Product data manipulation and visualization.

    As all org have different hierarchies it is not possible to generalize code for product data manipulation and visualization. Still, basic code is provided as a blueprint using a single product category.

  21. Cmd 93: Conclusion of EDA, Based on all the findings above concludes the analysis with help of markup language comments.