The following are the topics covered in this article.

Getting Started with Customer Data Platform (CDP - Databricks)

If you do not have access to the DataBricks platform, you can request access through the Google Form.

Cluster-level Databricks's URL 

To access DataBricks, open the URL of your cluster and login with your Capillary account credentials (single sign-on)

Common Use Cases and Resources

Important things to consider in CDP

  1. User_id is denoted by dim_event_user_id in all applicable tables accept users table where its denoted by user_id itself.

  2. Till_id is denoted by dim_event_zone_till_id in all applicable tables accept event_zone_tills table where it's till_id itself. 

  3. The date is denoted by dim_event_date_id in all applicable tables accept date table where the foreign key to dim_event_date_id is date_id and date is a column in date table in the format YYYY-MM-DD.  dim_event_date_id itself is formatted as YYYYMMDD

  4. item_id of inventory_items is to be joined with dim_item_id in bill_lineitems

  5. bill_summary table already removed outlier bills, fraud customers, admin stores and is at a billing level.  Use this table for all transactional KPI which are non-product in nature. 

  6. Use daily_till_summary for all points related queries, its already filtered on fraud customers, tracker points, and admin stores. Much more convenient to use.

Frequently used tables

  1. Bill_lineitems

  2. Bill_summary

  3. Event_zone_tills

  4. Users

  5. Customer_summary

  6. Points

  7. Contact_info

  8. Response_info

  9. Campaigns

  10. Campaign_groups

  11. Coupons

  12. Inventory_items

  13. daily_till_summary


Convert Zeppelin to Databricks Notebook

To convert a Zeppelin to Databricks Notebook, follow these steps.

  1. Move your notebook outside the folder. In the following example,  Daily is a folder.

  1. Open the link and login with the Username and Password provided.
  2. Click on any of the bash consoles.

  1. Upload your Zepplin Notebook.

  1. Click + Open another file, to upload your file.

  1. After importing, you should be able to see the files.

  1. Then, go to the console and run the following command. If your file name is jb_notebook.json  (zeppelin notebook)  then you will type.

python jupyter-zeppelin.py jb_notebook.json

You can import the converted file jb_notebook.ipynb in DataBricks.




Schedule Job in Databricks

To schedule a job, follow these steps.

1. Log on to Notebook and click + New to create a job.

2. In Create Schedule, set the scheduler details.

The job is created. However, to run it, you need to you perform the next step.

3. Click on the job name.

4. Click on the Edit option available for the cluster.

5. In Cluster Type, choose the Existing All-Purpose cluster.


6. Choose the cluster which comes down in the drop-down. You will see only 1 option. Click on it and confirm. Now you are done with scheduling the job.


Sample Codes to Kickstart

This article provides samples of important codes that will touch base on ad-hoc requests on all popular fact tables. Use this as your reference guide until you get the hands on it.

The following are the type of sample codes


1. Important notations to follow

  1. User_id is denoted by dim_event_user_id in all applicable tables accept users table where its denoted by user_id itself.

  2. Till_id is denoted by event_zone_till_id in all applicable tables accept event_zone_tills table where it is till_id itself. 

  3. The date is denoted by dim_event_date_id in all applicable tables accept date table where the foreign key to dim_event_date_id is date_id and date is a column in date table in the format YYYY-MM-DD.  dim_event_date_id itself is formatted as YYYYMMDD

  4. item_id of inventory_items is to be joined with dim_item_id in bill_lineitems

  5. Bill_summary table already removed outlier bills, fraud customers, admin stores and is at a bill-level.  Use this table for all transactional KPI which are non-product in nature. 

  6. Use daily_till_summary for all points related queries, its already filtered on fraud customers, tracker points, and admin stores. Much more convenient to use.

2. List of tables which are needed most of the times

  1. Bill_lineitems - Line item-level data also includes all bill level information and also includes not interested (non-member) data

  2. Bill_summary - Bill level table which should be used for all queries except product level for which you should use table. This table also removes fraud, outlier, and admin stores, so no standard filter is necessary.

  3. Event_zone_tills - Store equivalent table

  4. Users - Contains all custom fields and dimensions related to customers. 

  5. Customer_summary - Contains summarised results of customers based on its lifetime data also each row is created at the time of registration.

  6. Points - All points related to data are present here.

  7. Contact_info- Contact list of customers in a campaign

  8. Response_info - Transactions which got attributed to contacted customers. 

  9. Campaigns - Every row is about a campaign

  10. Campaign_groups - Every row is about a campaign group

  11. Coupons-  All coupon related data is present here


3. Transaction

The table of interest 

  1. bill_linitems (come to this only if you want to do product level analysis)

  2. bill_summary (use this for all your transaction-level analysis)


Month on Month total sales (longer method, for education purpose)

select 

substr(date,1,7) as month , count(bill_id) as total_bills, sum(bill_amount) as total_sales

from 

(

select 

bill_id, bill_amount, date

from 

read_api_29.bill_lineitems a join read_api_29.outlier_status b 

on a.dim_bill_outlier_status_id=b.id 

join read_api_29.outlier_status c 

on c.id=a.dim_line_item_outlier_status_id

join read_api_29.zone_tills s 

on s.till_id=a.dim_event_zone_till_id

join read_api_29.date x 

on x.date_id=a.dim_event_date_id

join read_api_29.users f 

on f.user_id=a.dim_event_user_id

where 

b.status in ('NORMAL','NOT-APPLICABLE','NOT-CAPTURED')  // bill outlier status

and c.status in ('NORMAL','NOT-APPLICABLE','NOT-CAPTURED')  // bill line item outlier status

and s.type IN ('GENERAL','NOT-APPLICABLE','INVALID','NOT-CAPTURED')  // store admin type

and f.fraud_status in ('MARKED','not-interested','NOT-CAPTURED','INVALID','NOT_FRAUD')  //customer fraud status

and x.date>='2018-01-01' and x.date<='2018-09-30' and a.year='2018'  //transaction date range



group by 1,2,3

)x

group by 1



Month on Month member  sales (Loyalty Sales)

select 

substr(date,1,7) as month , count(bill_id) as total_bills, sum(bill_amount) as total_sales

from 

(

select 

bill_id, bill_amount, date

from 

read_api_29.bill_lineitems a join read_api_29.outlier_status b 

on a.dim_bill_outlier_status_id=b.id 

join read_api_29.outlier_status c 

on c.id=a.dim_line_item_outlier_status_id

join read_api_29.zone_tills s 

on s.till_id=a.dim_event_zone_till_id

join read_api_29.date x 

on x.date_id=a.dim_event_date_id

join read_api_29.membership_type m

on m.id=a.dim_membership_id

join read_api_29.users f 

on f.user_id=a.dim_event_user_id

where 

b.status in ('NORMAL','NOT-APPLICABLE','NOT-CAPTURED')

and c.status in ('NORMAL','NOT-APPLICABLE','NOT-CAPTURED')

and s.type IN ('GENERAL','NOT-APPLICABLE','INVALID','NOT-CAPTURED')

and f.fraud_status in ('MARKED','not-interested','NOT-CAPTURED','INVALID','NOT_FRAUD')

and x.date>='2018-01-01' and x.date<='2018-09-30' and a.year='2018'

and m.type='member'

group by 1,2,3

)x

group by 1



Using bill_summary (always use bill summary for non-product level analysis) (loyalty bills)


select 

substr(date,1,7) as month, 

count(bill_id) as total_bills, 

count(case when m.type='member' then bill_id else null end) as member_bills , 

sum(bill_amount) as total_sales

from 

read_api_29.bill_summary a join read_api_29.date c 

on c.date_id=a.dim_event_date_id

join read_api_29.membership_type m

on m.id=a.dim_membership_id

where c.date>='2018-01-01' and c.date<='2018-09-30' and a.year='2018'

and m.type=’member’

group by 1



4. Registrations

Table of interest

Customer_summary table has all the registration information. dim_event_date_id there denotes registration date and dim_event_zone_tills_id as registered till id 

%sql 
 SELECT 
 count(dim_event_user_id) as registrations
 FROM 
 read_api_150432.customer_summary AS customer_summary LEFT JOIN read_api_150432.users AS users 
 ON customer_summary.dim_event_user_id = users.user_id 
 AND NOT (users.fraud_status IN ('MARKED','not-interested','NOT-CAPTURED','INVALID','NOT_FRAUD') 
 AND users.is_merged_customer IN ('UNMERGED','not-interested','INVALID','NOT-CAPTURED')) 
 LEFT JOIN read_api_150432.zone_tills AS zone_tills 
 ON customer_summary.dim_event_zone_till_id = zone_tills.till_id 
 AND NOT (zone_tills.type IN ('GENERAL','NOT-APPLICABLE','INVALID','NOT-CAPTURED'))  
 WHERE customer_summary.year >= year('2018-09-25') 
 AND customer_summary.year <= year('2018-10-25') 
 AND customer_summary.dim_event_date_id >= date_format('2018-09-25','YYYYMMdd') 
 AND customer_summary.dim_event_date_id <= date_format('2018-10-25','YYYYMMdd') 
 AND users.user_id IS NULL 
 AND zone_tills.till_id IS NULL 


General Points Code

SELECT 

SUM(`daily_till_summary`.`points_expired`) AS `points_expired`,

SUM(`daily_till_summary`.`points_redeemed`) AS `points_redeemed`,

SUM(`daily_till_summary`.`points_issued`) AS `points_issued`



 FROM `read_api_150595`.`daily_till_summary` AS `daily_till_summary`  WHERE `daily_till_summary`.`year` >= year('2019-03-04') AND `daily_till_summary`.`year` <= year('2019-04-03') AND `daily_till_summary`.`dim_event_date_id` >= date_format('2019-03-04','yyyyMMdd') AND `daily_till_summary`.`dim_event_date_id` <= date_format('2019-04-03','yyyyMMdd')


Long code with all filters

SELECT 

SUM((CASE  WHEN `deduction_type_deduction_type`.`type` = 'REDEEMED' AND `category_points_category`.`redeemtion_type` IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED') THEN `points`.`deducted_points` ELSE 0 END)) AS `points_redeemed`,

SUM((CASE  WHEN `deduction_type_deduction_type`.`type` = 'EXPIRED' THEN `points`.`deducted_points` ELSE 0 END)) AS `points_expired`,

SUM((CASE  WHEN `deduction_type_deduction_type`.`type` = 'RETURN' THEN `points`.`deducted_points` ELSE 0 END)) AS `points_return`,

SUM((CASE  WHEN `points_event_type_points_event_type`.`type` = 'POINTS_AWARDED' AND `category_points_category`.`redeemtion_type` IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED') THEN `points`.`allocated_points` ELSE 0 END)) AS `points_awarded` 



FROM `read_api_150971`.`points` AS `points` INNER JOIN `read_api_150971`.`points_event_type` AS `points_event_type_points_event_type` ON `points`.`dim_points_event_type_id` = `points_event_type_points_event_type`.`id` INNER JOIN `read_api_150971`.`points_category` AS `category_points_category` ON `points`.`dim_category_id` = `category_points_category`.`category_id` INNER JOIN `read_api_150971`.`deduction_type` AS `deduction_type_deduction_type` ON `points`.`dim_deduction_type_id` = `deduction_type_deduction_type`.`id` LEFT JOIN `read_api_150971`.`points_category` AS `category_points_category_sf` ON `points`.`dim_category_id` = `category_points_category_sf`.`category_id` AND NOT (`category_points_category_sf`.`redeemtion_type` IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED')) LEFT JOIN `read_api_150971`.`users` AS `event_user_users_sf` ON `points`.`dim_event_user_id` = `event_user_users_sf`.`user_id` AND NOT (`event_user_users_sf`.`fraud_status` IN ('MARKED','not-interested','NOT-CAPTURED','INVALID','NOT_FRAUD')) LEFT JOIN `read_api_150971`.`zone_tills` AS `event_zone_till_zone_tills_sf` ON `points`.`dim_event_zone_till_id` = `event_zone_till_zone_tills_sf`.`till_id` AND NOT (`event_zone_till_zone_tills_sf`.`type` IN ('GENERAL','NOT-APPLICABLE','INVALID','NOT-CAPTURED'))  



WHERE `points`.`year` >= year('2019-07-16') AND `points`.`year` <= year('2019-08-15') AND `points`.`dim_event_date_id` >= date_format('2019-07-16','yyyyMMdd') AND `points`.`dim_event_date_id` <= date_format('2019-08-15','yyyyMMdd') AND `event_user_users_sf`.`user_id` IS NULL AND `event_zone_till_zone_tills_sf`.`till_id` IS NULL AND `category_points_category_sf`.`category_id` IS NULL 


Points about to expire


SELECT 

dim_awarded_expiry_date_id,

 SUM((CASE  WHEN `points_event_type_points_event_type`.`type` = 'POINTS_AWARDED' AND `category_points_category`.`redeemtion_type` IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED') THEN `points`.`allocated_points` ELSE 0 END)) AS `points issued`,

 

 SUM((CASE  WHEN `deduction_type_deduction_type`.`type` = 'REDEEMED' AND `category_points_category`.`redeemtion_type` IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED') THEN `points`.`deducted_points` ELSE 0 END)) AS `points redeemed` ,

 



  SUM((CASE  WHEN `points_event_type_points_event_type`.`type` = 'POINTS_AWARDED' AND `category_points_category`.`redeemtion_type` IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED') THEN `points`.`allocated_points` ELSE 0 END)) -  SUM((CASE  WHEN `deduction_type_deduction_type`.`type` = 'REDEEMED' AND `category_points_category`.`redeemtion_type` IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED') THEN `points`.`deducted_points` ELSE 0 END)) as points_about_to_expire

 

 FROM `read_api_150595`.`points` AS `points` INNER JOIN `read_api_150595`.`deduction_type` AS `deduction_type_deduction_type` 

 ON `points`.`dim_deduction_type_id` = `deduction_type_deduction_type`.`id` 

 INNER JOIN `read_api_150595`.`points_category` AS `category_points_category` 

 ON `points`.`dim_category_id` = `category_points_category`.`category_id` 

 INNER JOIN `read_api_150595`.`points_event_type` AS `points_event_type_points_event_type` 

 ON `points`.`dim_points_event_type_id` = `points_event_type_points_event_type`.`id` 

 LEFT JOIN `read_api_150595`.`points_category` AS `category_points_category_sf` 

 ON `points`.`dim_category_id` = `category_points_category_sf`.`category_id` AND NOT (`category_points_category_sf`.`redeemtion_type` IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED')) LEFT JOIN `read_api 150595`.`users` AS `event_user_users_sf` 

 ON `points`.`dim_event_user_id` = `event_user_users_sf`.`user_id` AND NOT (`event_user_users_sf`.`fraud_status` IN ('MARKED','not-interested','NOT-CAPTURED','INVALID','NOT_FRAUD')) 

 LEFT JOIN `read_api_150595`.`zone_tills` AS `event_zone_till_zone_tills_sf` 

 ON `points`.`dim_event_zone_till_id` = `event_zone_till_zone_tills_sf`.`till_id` AND NOT (`event_zone_till_zone_tills_sf`.`type` IN ('GENERAL','NOT-APPLICABLE','INVALID','NOT-CAPTURED'))  

 

 WHERE `points`.`year` >= year('2019-03-04') AND `points`.`year` <= year('2019-04-03') AND `points`.`dim_awarded_expiry_date_id` >= date_format('2019-03-04','yyyyMMdd') AND `points`.`dim_awarded_expiry_date_id` <= date_format('2019-04-03','yyyyMMdd') AND `event_user_users_sf`.`user_id` IS NULL AND `category_points_category_sf`.`category_id` IS NULL AND `event_zone_till_zone_tills_sf`.`till_id` IS NULL

 group by 1


Troubleshooting for points reconciliation


%sql

SELECT 

points.dim_event_user_id, max(loyalty_points) as loyalty_points,

SUM((CASE  WHEN deduction_type_deduction_type.type = 'REDEEMED' AND category_points_category.redeemtion_type IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED') THEN points.deducted_points ELSE 0 END)) AS points_redeemed,

SUM((CASE  WHEN deduction_type_deduction_type.type = 'EXPIRED' THEN points.deducted_points ELSE 0 END)) AS points_expired,

SUM((CASE  WHEN deduction_type_deduction_type.type = 'RETURN' THEN points.deducted_points ELSE 0 END)) AS points_return,

SUM((CASE  WHEN points_event_type_points_event_type.type = 'POINTS_AWARDED' AND category_points_category.redeemtion_type IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED') THEN points.allocated_points ELSE 0 END)) AS points_awarded 



FROM read_api_313.points AS points INNER JOIN read_api_313.points_event_type AS points_event_type_points_event_type ON points.dim_points_event_type_id = points_event_type_points_event_type.id INNER JOIN read_api_313.points_category AS category_points_category ON points.dim_category_id = category_points_category.category_id INNER JOIN read_api_313.deduction_type AS deduction_type_deduction_type ON points.dim_deduction_type_id = deduction_type_deduction_type.id LEFT JOIN read_api_313.points_category AS category_points_category_sf ON points.dim_category_id = category_points_category_sf.category_id AND NOT (category_points_category_sf.redeemtion_type IN ('REDEEMABLE','NOT-APPLICABLE','INVALID','NOT-CAPTURED')) LEFT JOIN read_api_313.users AS event_user_users_sf ON points.dim_event_user_id = event_user_users_sf.user_id AND NOT (event_user_users_sf.fraud_status IN ('MARKED','not-interested','NOT-CAPTURED','INVALID','NOT_FRAUD')) LEFT JOIN read_api_313.zone_tills AS event_zone_till_zone_tills_sf ON points.dim_event_zone_till_id = event_zone_till_zone_tills_sf.till_id AND NOT (event_zone_till_zone_tills_sf.type IN ('GENERAL','NOT-APPLICABLE','INVALID','NOT-CAPTURED'))  

join read_api_313.customer_summary xx on xx.dim_event_user_id=points.dim_event_user_id



WHERE points.year >= year('2019-01-01') AND points.year <= year('2019-08-15') AND xx.dim_event_date_id >= date_format('2019-07-16','yyyyMMdd') AND xx.dim_event_date_id <= date_format('2019-08-15','yyyyMMdd') AND event_user_users_sf.user_id IS NULL AND event_zone_till_zone_tills_sf.till_id IS NULL AND category_points_category_sf.category_id IS NULL 

group by 1


Promised points

Non-redeemable points with future points allocation dates

SELECT 

SUM(`points`.`allocated_points`) AS `points_awarded` 



FROM `read_api_100441`.`points` AS `points` INNER JOIN `read_api_100441`.`points_event_type` AS `points_event_type_points_event_type` ON `points`.`dim_points_event_type_id` = `points_event_type_points_event_type`.`id` INNER JOIN `read_api_100441`.`points_category` AS `category_points_category` ON `points`.`dim_category_id` = `category_points_category`.`category_id`



WHERE  `points_event_type_points_event_type`.`type`='POINTS_AWARDED' and `category_points_category`.`redeemtion_type`='NON-REDEEMABLE'

and `points`.`dim_awarded_date_id` > date_format(now(),'yyyyMMdd')


6. Campaign Contacted and Responded

Contacted Code

SELECT 

`campaign_group_campaign_group`.`group_type` AS `campaign_group.group_type`,

`campaign_campaigns`.`name` AS `campaign.name`,

`campaign_group_campaign_group`.`name` AS `campaign_group.name`,

COUNT( DISTINCT `contact_info_cluster_dim_event_user_id`.`dim_event_user_id`) AS `customers_contacted`



FROM `read_api_150321`.

`contact_info` AS `contact_info_cluster_dim_event_user_id` 

INNER JOIN `read_api_150321`.`campaigns` AS `campaign_campaigns` 

ON `contact_info_cluster_dim_event_user_id`.`dim_campaign_id` = `campaign_campaigns`.`campaign_id` 

INNER JOIN `read_api_150321`.`campaign_group` AS `campaign_group_campaign_group` 

ON `contact_info_cluster_dim_event_user_id`.`dim_campaign_group_id` = `campaign_group_campaign_group`.`id` 

LEFT JOIN `read_api_150321`.`campaign_delivery_status` AS `campaign_delivery_status_campaign_delivery_status_sf` 

ON `contact_info_cluster_dim_event_user_id`.`dim_campaign_delivery_status_id` = `campaign_delivery_status_campaign_delivery_status_sf`.`status_id` 

AND NOT (`campaign_delivery_status_campaign_delivery_status_sf`.`campaign_legend_lebel` 

IN ('Bounced','Clicked','Delivered','Marked Spam','Not Delivered','Opened','Sent','INVALID','NOT-CAPTURED' ,'NOT-APPLICABLE'))  



WHERE `contact_info_cluster_dim_event_user_id`.`year` >= year('2018-10-28') 

AND `contact_info_cluster_dim_event_user_id`.`year` <= year('2018-11-27') 

AND `contact_info_cluster_dim_event_user_id`.`dim_event_date_id` >= date_format('2018-10-28','YYYYMMdd') 

AND `contact_info_cluster_dim_event_user_id`.`dim_event_date_id` <= date_format('2018-11-27','YYYYMMdd') 

AND `campaign_delivery_status_campaign_delivery_status_sf`.`status_id` IS NULL 



group by 1,2,3


Responded Code

SELECT

`campaign_group_campaign_group`.`group_type` AS `campaign_group.group_type`,

`campaign_campaigns`.`name` AS `campaign.name`,

`campaign_group_campaign_group`.`name` AS `campaign_group.name`, 

COUNT( DISTINCT `response_bill_summary_cluster_dim_event_user_id`.`dim_event_user_id`) AS `customer_responded`



FROM `read_api_150321`.`response_info` AS `response_bill_summary_cluster_dim_event_user_id` 

INNER JOIN `read_api_150321`.`campaigns` AS `campaign_campaigns` ON `response_bill_summary_cluster_dim_event_user_id`.`dim_campaign_id` = `campaign_campaigns`.`campaign_id` INNER JOIN `read_api_150321`.`campaign_group` AS `campaign_group_campaign_group` ON `response_bill_summary_cluster_dim_event_user_id`.`dim_campaign_group_id` = `campaign_group_campaign_group`.`id`  WHERE `response_bill_summary_cluster_dim_event_user_id`.`year` >= year('2018-10-28') AND `response_bill_summary_cluster_dim_event_user_id`.`year` <= year('2018-11-27') AND `response_bill_summary_cluster_dim_event_user_id`.`dim_event_date_id` >= date_format('2018-10-28','YYYYMMdd') AND `response_bill_summary_cluster_dim_event_user_id`.`dim_event_date_id` <= date_format('2018-11-27','YYYYMMdd') 

GROUP BY 1,2,3


7. Timeline ROI response code

This code computes respond within 15 days from contact made.  You can change it as per your requirement.

%sql

select

zt.store_name,

oc.campaign_name,

m.name as Milestone,

ct.type as Communication_Type,

count(distinct dim_event_user_id) as Responders,

count(distinct bill_id) as Responder_Bills,

round(sum(bill_amount),0) as responder_sales

from

(

select

b.bill_id,

a.dim_event_user_id, 

maf.date as contact_date,

bs.date as response_date,

dim_milestone_id,

dim_org_config_id,

dim_phase_id,

dim_timeline_id,

b.bill_amount,

b.dim_event_zone_till_id,

a.dim_communication_type_id,

datediff(bs.date,maf.date) as difference,

rank() over (partition by b.bill_id,dim_org_config_id order by datediff(bs.date,maf.date) asc) as diff_rank

from

read_api_977.milestone_activity_fact a join read_api_977.bill_summary b 

on b.dim_event_user_id=a.dim_event_user_id

join read_api_977.date maf on maf.date_id=a.dim_event_date_id

join read_api_977.date bs on bs.date_id=b.dim_event_date_id

where 

bs.date>=maf.date and b.dim_event_time_id>a.dim_event_time_id and datediff(bs.date,maf.date)>=0 and datediff(bs.date,maf.date)<=15

and maf.date>='2019-01-01' and maf.date<='2019-03-31' 

and a.year>=year('2019-01-01') and a.year<=year('2019-03-31')

and b.year>=year('2019-01-01') and b.year<=year('2019-03-31')

group by 1,2,3,4,5,6,7,8,9,10,11



)x 

join read_api_977.milestone m on x.dim_milestone_id=m.id

join read_api_977.org_config oc on x.dim_org_config_id=oc.id

join read_api_977.zone_tills zt on zt.till_id=x.dim_event_zone_till_id

join read_api_977.communication_type ct on ct.id=x.dim_communication_type_id

where diff_rank=1

group by  1,2,3,4


Contacts made

select

dim_milestone_id,

dim_org_config_id,

dim_phase_id,

dim_timeline_id,

count(distinct a.dim_event_user_id) as Contact_made

from

read_api_977.milestone_activity_fact a 

join read_api_977.date maf on maf.date_id=a.dim_event_date_id

where 

maf.date>='2019-01-01' and maf.date<='2019-03-31' 

and a.year>=year('2019-01-01') and a.year<=year('2019-03-31')

group by 1,2,3,4


8. The point in time slab wise reporting


select slab_name, whatever you want to do 

FROM 

read_api_1040.bill_summary AS bill_summary INNER JOIN read_api_1040.date AS event_date_date 

ON bill_summary.dim_event_date_id = event_date_date.date_id 

INNER JOIN read_api_1040.time AS event_time_time 

ON bill_summary.dim_event_time_id = event_time_time.time_id 

INNER JOIN read_api_1040.users_slab_name_history AS slab_history 

ON bill_summary.dim_event_user_id = slab_history.user_id AND 

CONCAT(event_date_date.date,' ',event_time_time.time) >= slab_history.event_date_time 

AND CONCAT(event_date_date.date,' ',event_time_time.time) < slab_history.expiry_date_id 

group by 1


9. Payment mode queries

select

a. bill_id, sum(a.payment_mode_amount) as abcd , first(b.bill_amount) as pqrs, abs(sum(a.payment_mode_amount)-first(b.bill_amount))/first(b.bill_amount) as percentage_diff

from

read_api_150330.payment_details_fact a join read_api_150330.bill_summary b 

on b.bill_id=a.bill_id

join read_api_150330.payment_mode c 

on c.id=a.dim_payment_mode_id

join read_api_150330.date d 

on d.date_id=b.dim_event_date_id

where d.date>='2018-01-01' and d.date<='2018-12-31' and a.year=b.year and b.year='2018'

group by 1


10. Bill to points redemption tagging

Points to bill number tagging

select 

points.dim_event_user_id, 

points.date, points.store_name, 

points.deducted_points, 

points.deduction_time,

bills.bill_time, 

bills.bill_id,

bills.bill_number,

bills.bill_amount,

bills.bill_time-points.deduction_time as diff,

rank() over (partition by points.dim_event_user_id, points.date, points.store_name order by bills.bill_time-points.deduction_time asc) as diff_rank

from

(

select

points.dim_event_user_id, d.date,store_name, time_id as deduction_time ,sum(deducted_points) as deducted_points

FROM read_api_100202.points AS points 

INNER JOIN read_api_100202.deduction_type AS deduction_type_deduction_type 

ON points.dim_deduction_type_id = deduction_type_deduction_type.id 

INNER JOIN read_api_100202.points_event_type AS points_event_type_points_event_type 

ON points.dim_points_event_type_id = points_event_type_points_event_type.id 

INNER JOIN read_api_100202.time t 

on t.time_id=points.dim_event_time_id

INNER JOIN read_api_100202.date d 

on d.date_id=points.dim_event_date_id

INNER JOIN read_api_100202.zone_tills zt

on zt.till_id=points.dim_event_zone_till_id

where 

points_event_type_points_event_type.type = 'POINTS_DEDUCTED' and  dim_deduction_type_id=4

and d.date>='2019-01-01' and d.date<='2019-01-31'

and points.year='2019'

group by 1,2,3,4



) points 

join 



(



select a.dim_event_user_id,date,store_name,a.bill_id,first(time_id) as bill_time , first(bill_number) as bill_number, first(a.bill_amount) as bill_amount

from 

read_api_100202.bill_lineitems a 

join read_api_100202.date d on a.dim_event_date_id=d.date_id

join read_api_100202.time t on t.time_id=a.dim_event_time_id

join read_api_100202.zone_tills zt on zt.till_id=a.dim_event_zone_till_id

join read_api_100202.bill_summary bs on bs.bill_id=a.bill_id

where 

d.date>='2019-01-01' and d.date<='2019-01-31'

and a.year='2019' and a.dim_event_user_id>0

group by 1,2,3,4

) bills 

on bills.dim_event_user_id=points.dim_event_user_id

and bills.date=points.date

and bills.store_name=points.store_name

having diff_rank=1



Sample Code to Push Data from S3 to Databricks


You can create external tables in zeplinn by placing external data in s3 buckets. Below are the bucket credentials and table creation schema. Reach out to Satish or Jyotiska for any assistance.

EU Cluster

APAC 2 Cluster

India Cluster


keys 

S.key = pKDSUFnV9hg+xew36/01E+7ud5GwtL3/SDh/LxwB

A.key = AKIAI2P7N4J7XK2VU4RQ


Creating tables in Databricks from s3

CREATE TABLE db_name.table_name 

USING CSV OPTIONS ('path' 's3a://reon.etl-data.sg/analytics/data/gsk/pufiles/report2020_03_16_22_00_04.csv','header' 'true','inferSchema' 'true')


For parquet 

CREATE TABLE db_name.table_name USING PARQUET OPTIONS ('path' 's3a://capciapac2/v2/personalizedModelTraining/150951/4176/fir_and_proxmity_model/')


Moving files from one FTP to another FTP folder

rom ftplib import *

ftp = FTP('data.capillarydata.com')

ftp.set_pasv(True)

ftp.login(username, 'password')

ftp.rename('/ecom/hanuman.csv','/emea_ph_oman/promotions/sm/hanuman.csv')

ftp.quit()

Sample Code to Push Data from S3 to Databricks

You can create external tables in zeplinn by placing external data in s3 buckets. Below are the bucket credentials and table creation schema. Reach out to Satish or Jyotiska for any assistance.

EU Cluster

APAC 2 Cluster

India Cluster


keys 

S.key = pKDSUFnV9hg+xew36/01E+7ud5GwtL3/SDh/LxwB

A.key = AKIAI2P7N4J7XK2VU4RQ


Creating tables in Databricks from s3

CREATE TABLE db_name.table_name 

USING CSV OPTIONS ('path' 's3a://reon.etl-data.sg/analytics/data/gsk/pufiles/report2020_03_16_22_00_04.csv','header' 'true','inferSchema' 'true')


For parquet

CREATE TABLE db_name.table_name USING PARQUET OPTIONS ('path' 's3a://capciapac2/v2/personalizedModelTraining/150951/4176/fir_and_proxmity_model/')


Moving files from one FTP to another FTP folder

rom ftplib import *

ftp = FTP('data.capillarydata.com')

ftp.set_pasv(True)

ftp.login(username, 'password')

ftp.rename('/ecom/hanuman.csv','/emea_ph_oman/promotions/sm/hanuman.csv')

ftp.quit()


Renaming the file moves it.


Add Timestamp at the End of Dataframe

To add timestamp at the end of Dataframe:

from datetime import date


globals()['df_'+str(date.today()).replace('-','_')]=spark.sql("select * from read_api_100140.zone_tills limit 10").toPandas()


Embed Databricks Notebook in Insights+

Step-by-step guidelines on how to embed Databricks Notebook in Insights+.

  1. Select the notebook that you want to embed inInsights+. For example, Introduction to Capillary data schema in the following screenshot.
  2. Get the path of the notebook.
  3. Generate URL for embedding
    Generic embedding URL:
    https://<databricks_cluster_hostname>/api/2.0/workspace/export?path=<path_to_notebookresource>&direct_download=true&format=HTML

    <path_to_notebookresource> should be the actual path of the notebook that you want to embed. 

     You need to replace it with your cluster URL. Example, for India cluster, it will be capillary-india.cloud.databricks.com

    Here is the final URL
    https://capillary-india.cloud.databricks.com/api/2.0/workspace/export?path=/KnowlesgeBase/1. Introduction to Capillary Data Schema&direct_download=true&format=HTML

    While copying the file url remove https:// , so the final URL will be  

    capillary-india.cloud.databricks.com/api/2.0/workspace/export?path=/KnowlesgeBase/1. Introduction to Capillary Data Schema&direct_download=true&format=HTML

  4. Navigate to Insights+ and create a report under “external”
  5. Enter the URL and set authorization as capillary_notebook.
  6. Click Preview.
    The report will render, you can save the report. and you are done.