CN 01: Data Preparation(Python)
Loading...

The purpose of this notebook is to prepare the dataset we will use to explore content-based filtering recommenders. This notebook should be run on a Databricks ML 7.3+ cluster.

Introduction

Content-based recommenders enable the familiar Related products-type of recommendation. These recommenders help customers identify product alternatives and better ensure that customer interactions result in sales:

The data used to build such recommendations can also be used to build more expansive recommendations, typical of the Based on Your Recent History-type recommendations, which take into consideration feature preferences aligned with user feedback:

from pyspark.sql.types import *
from pyspark.sql.functions import count,  min, max, instr, monotonically_increasing_id, pandas_udf
 
from delta.tables import *
 
import pandas as pd
 
import gzip
import shutil
 
import requests
import html

Step 1: Download & Decompress Files

The basic building block of this type of recommender is product data. These data may include information about the manufacturer, price, materials, country of origin, etc. and are typically accompanied by friendly product names and descriptions. In this series of notebooks, we'll focus on making use of the unstructured information found in product titles and descriptions as well as product category information.

The dataset we will use is the 2018 Amazon reviews dataset. It consists of several files representing both user-generated reviews and product metadata. Focusing on the 5-core subset of data in which all users and items have at least 5 reviews, we will download the gzip-compressed files to the gz folder associated with a cloud-storage mount point identified as /mnt/reviews before decompressing the metadata JSON files to a folder named metadata and reviews JSON files to a folder named reviews. Please note that the files associated with the Books category of products is being skipped as repeated requests for this file seem to be triggering download throttling from the file server:

NOTE We are providing code for the downloading of these files to your storage account as similar code is supplied by the data provider. However, you are strongly encouraged to visit the download site referenced above to review the terms and conditions for the use of this data before executing the code. Also note that the variable perform_download is set to False to prevent the unintended downloading of data from the provider. You'll need to change that variable's value to enable the download code in this notebook.

# directories for data files
download_path = '/mnt/reviews/bronze/gz'
metadata_path = '/mnt/reviews/bronze/metadata'
reviews_path = '/mnt/reviews/bronze/reviews'
 
perform_download = False # set to True if you wish to redownload the gzip files
file_urls_to_download = [
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/AMAZON_FASHION.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_AMAZON_FASHION.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_All_Beauty.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Appliances.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Appliances.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Arts_Crafts_and_Sewing.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Arts_Crafts_and_Sewing.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Automotive.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Automotive.json.gz',
  #'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Books.json.gz',  # skip these files to avoid overtaxing the data provider
  #'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Books.json.gz', # skip these files to avoid overtaxing the data provider
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/CDs_and_Vinyl.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_CDs_and_Vinyl.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Cell_Phones_and_Accessories.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Cell_Phones_and_Accessories.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Clothing_Shoes_and_Jewelry.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Clothing_Shoes_and_Jewelry.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Digital_Music.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Digital_Music.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Electronics.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Electronics.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Gift_Cards.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Gift_Cards.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Grocery_and_Gourmet_Food.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Grocery_and_Gourmet_Food.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Home_and_Kitchen.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Home_and_Kitchen.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Industrial_and_Scientific.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Industrial_and_Scientific.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Kindle_Store.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Kindle_Store.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Luxury_Beauty.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Luxury_Beauty.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Magazine_Subscriptions.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Magazine_Subscriptions.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Movies_and_TV.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Movies_and_TV.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Musical_Instruments.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Musical_Instruments.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Office_Products.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Office_Products.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Patio_Lawn_and_Garden.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Patio_Lawn_and_Garden.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Pet_Supplies.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Pet_Supplies.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Prime_Pantry.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Prime_Pantry.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Software.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Software.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Sports_and_Outdoors.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Sports_and_Outdoors.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Tools_and_Home_Improvement.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Tools_and_Home_Improvement.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Toys_and_Games.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Toys_and_Games.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Video_Games.json.gz',
  'http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Video_Games.json.gz'
  ]
if perform_download:
 
  # clean up directories from prior runs
  try:
    dbutils.fs.rm(download_path, recurse=True)
  except:
    pass
  dbutils.fs.mkdirs(download_path)
 
  try:
    dbutils.fs.rm(metadata_path, recurse=True)
  except:
    pass
  dbutils.fs.mkdirs(metadata_path)
 
  try:
    dbutils.fs.rm(reviews_path, recurse=True)
  except:
    pass
  dbutils.fs.mkdirs(reviews_path)
if perform_download:
  
  # for each file to download:
  for file_url in file_urls_to_download:
    
    print(file_url)
    
    # extract file names from the url
    gz_file_name = file_url.split('/')[-1]
    json_file_name = gz_file_name[:-3]
    
    # determine where to place unzipped json
    if 'meta_' in json_file_name:
      json_path = metadata_path
    else:
      json_path = reviews_path
 
    # download the gzipped file
    request = requests.get(file_url)
    with open('/dbfs' + download_path + '/' + gz_file_name, 'wb') as f:
      f.write(request.content)
 
    # decompress the file
    with gzip.open('/dbfs' + download_path + '/' + gz_file_name, 'rb') as f_in:
      with open('/dbfs' + json_path + '/' + json_file_name, 'wb') as f_out:
          shutil.copyfileobj(f_in, f_out)

Let's verify we now have decompressed JSON files in our metadata and reviews folders:

display(
  dbutils.fs.ls(metadata_path)
  )
 
path
name
size
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
dbfs:/mnt/reviews/bronze/metadata/meta_AMAZON_FASHION.json
meta_AMAZON_FASHION.json
1085109613
dbfs:/mnt/reviews/bronze/metadata/meta_All_Beauty.json
meta_All_Beauty.json
62219348
dbfs:/mnt/reviews/bronze/metadata/meta_Appliances.json
meta_Appliances.json
667198861
dbfs:/mnt/reviews/bronze/metadata/meta_Arts_Crafts_and_Sewing.json
meta_Arts_Crafts_and_Sewing.json
764211606
dbfs:/mnt/reviews/bronze/metadata/meta_Automotive.json
meta_Automotive.json
18203705859
dbfs:/mnt/reviews/bronze/metadata/meta_CDs_and_Vinyl.json
meta_CDs_and_Vinyl.json
641588000
dbfs:/mnt/reviews/bronze/metadata/meta_Cell_Phones_and_Accessories.json
meta_Cell_Phones_and_Accessories.json
2675494797
dbfs:/mnt/reviews/bronze/metadata/meta_Clothing_Shoes_and_Jewelry.json
meta_Clothing_Shoes_and_Jewelry.json
19319015250
dbfs:/mnt/reviews/bronze/metadata/meta_Digital_Music.json
meta_Digital_Music.json
62608763
dbfs:/mnt/reviews/bronze/metadata/meta_Electronics.json
meta_Electronics.json
10831005689
dbfs:/mnt/reviews/bronze/metadata/meta_Gift_Cards.json
meta_Gift_Cards.json
2545166
dbfs:/mnt/reviews/bronze/metadata/meta_Grocery_and_Gourmet_Food.json
meta_Grocery_and_Gourmet_Food.json
480896585
dbfs:/mnt/reviews/bronze/metadata/meta_Home_and_Kitchen.json
meta_Home_and_Kitchen.json
16709230982
dbfs:/mnt/reviews/bronze/metadata/meta_Industrial_and_Scientific.json
meta_Industrial_and_Scientific.json
620580843
dbfs:/mnt/reviews/bronze/metadata/meta_Kindle_Store.json
meta_Kindle_Store.json
484575387
dbfs:/mnt/reviews/bronze/metadata/meta_Luxury_Beauty.json
meta_Luxury_Beauty.json
29207231
dbfs:/mnt/reviews/bronze/metadata/meta_Magazine_Subscriptions.json
meta_Magazine_Subscriptions.json
5746237
dbfs:/mnt/reviews/bronze/metadata/meta_Movies_and_TV.json
meta_Movies_and_TV.json
322974517

Showing all 28 rows.

display(
  dbutils.fs.ls(reviews_path)
  )
 
path
name
size
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
dbfs:/mnt/reviews/bronze/reviews/AMAZON_FASHION.json
AMAZON_FASHION.json
356575899
dbfs:/mnt/reviews/bronze/reviews/All_Beauty.json
All_Beauty.json
171042038
dbfs:/mnt/reviews/bronze/reviews/Appliances.json
Appliances.json
263143804
dbfs:/mnt/reviews/bronze/reviews/Arts_Crafts_and_Sewing.json
Arts_Crafts_and_Sewing.json
1175211462
dbfs:/mnt/reviews/bronze/reviews/Automotive.json
Automotive.json
3350210443
dbfs:/mnt/reviews/bronze/reviews/CDs_and_Vinyl.json
CDs_and_Vinyl.json
3559158284
dbfs:/mnt/reviews/bronze/reviews/Cell_Phones_and_Accessories.json
Cell_Phones_and_Accessories.json
4639420453
dbfs:/mnt/reviews/bronze/reviews/Clothing_Shoes_and_Jewelry.json
Clothing_Shoes_and_Jewelry.json
14144939923
dbfs:/mnt/reviews/bronze/reviews/Digital_Music.json
Digital_Music.json
755164975
dbfs:/mnt/reviews/bronze/reviews/Electronics.json
Electronics.json
11508470738
dbfs:/mnt/reviews/bronze/reviews/Gift_Cards.json
Gift_Cards.json
49598201
dbfs:/mnt/reviews/bronze/reviews/Grocery_and_Gourmet_Food.json
Grocery_and_Gourmet_Food.json
2199285472
dbfs:/mnt/reviews/bronze/reviews/Home_and_Kitchen.json
Home_and_Kitchen.json
10225926270
dbfs:/mnt/reviews/bronze/reviews/Industrial_and_Scientific.json
Industrial_and_Scientific.json
776121292
dbfs:/mnt/reviews/bronze/reviews/Kindle_Store.json
Kindle_Store.json
3792361372
dbfs:/mnt/reviews/bronze/reviews/Luxury_Beauty.json
Luxury_Beauty.json
271312513
dbfs:/mnt/reviews/bronze/reviews/Magazine_Subscriptions.json
Magazine_Subscriptions.json
45334319
dbfs:/mnt/reviews/bronze/reviews/Movies_and_TV.json
Movies_and_TV.json
5176242426

Showing all 28 rows.

Step 2: Prep Metadata

With our metadata files in place, let's extract the relevant information from the documents and make the information more easily queriable. In reviewing the metadata files, it appears the brand, category, title & description fields along with each product's unique identifier, it's Amazon Standard Identification Number (asin), will be of useful. Quite a bit more information is available in the metadata files but we'll limit our attention to just these fields:

_ = spark.sql('DROP DATABASE IF EXISTS reviews CASCADE')
_ = spark.sql('CREATE DATABASE reviews')
# common elements of interest from json docs (only import ones actually used later)
metadata_common_schema = StructType([
    StructField('asin', StringType()),
    StructField('category', ArrayType(StringType())),
    StructField('description', ArrayType(StringType())),
    StructField('title', StringType())
    ])
 
# read json to dataframe
raw_metadata = (
  spark
    .read
    .json(
      metadata_path,
      schema=metadata_common_schema
      )
    )
 
display(raw_metadata)
 
asin
category
description
title
1
2
3
4
5
6
7
8
9
10
11
12
13
14
630456984X
[]
[]
Dante's Peak - Laserdisc
7106116521
[]
[]
Milliongadgets(TM) Earring Safety Backs For Fish Hook Small Earrings (150)
8037200124
[]
["Educating the next generation in the responsible re-use of bags, Envirosax Kids will help tomorrows future move in the right direction. Bag dimensions (unrolled) 17.70 x 15.75 Weight capacity 40 lbs Bag dimensions (rolled-up) 4 x 1.5 Bag weight 1.3 oz"]
Envirosax Kids Series Jessie & Lulu
8037200221
[]
["Envirosax Greengrocer - a range of plain coloured ultra-strong polyester rip-stop bags, which also have a waterproof backing. In an array of delectable colours named after the foods we love, Envirosax Greengrocer is a solution for those who like the simple things in life. Bag dimensions (unrolled) 18.70 x 16.5 Weight capacity 44 lbs Bag dimensions (rolled-up) 4 x 1.5 Bag weight 1.4 oz"]
Envirosax Greengrocer Series Bag 7 Guava
8279996567
[]
[]
Blessed by Pope Benedetto XVI Our Lady of Guadalupe Rose Scented Rosary Rosario Olor a Rosas
9239282785
[]
[]
Tideclothes ALAGIRLS Strapless Beading Homecoming Prom Dresses Short Tulle Formal Gowns White 16
9239281533
[]
[]
ALAGIRLS Strapless Beading Homecoming Prom Dresses Short Tulle Formal Gowns Grape 26Plus
9269808971
[]
[]
Syma S107C 3channel Coaxial Mini Spy Cam Helicopter (White) **MICRO SD CARD NOT INCLUDED**
9654263246
[]
["When you pull out your extra large carbon fiber money clip it gives you that status as you know whats new and has people asking what it is and where did you get it? Each carbon fiber money clip is made from the highest quality Carbon Fiber in America. Each money clip is infused with an epoxy resin giving you a strong and light weight money clip. This gives you the ability to walk through airport detectors with out having to put your money into a plastic tray. Each money clip comes with a 5 year warranty against de-lamination. Carbon fiber money clips very similar to these sell for up to 4 times what you can buy it for here. The reason for why the price is so exceptional is do the number on money clips that are manufactured at a time."]
X. L. Carbon Fiber Money Clip, made in the USA
B00004T3SN
[]
["A fun addition to any costume party, play, or Halloween event. These gorgeous cat ears have soft black \"fur\" on the outside with pink lining for the ears on the inside. Each ear also has a playful bell and bow. These fun cat ears are secured with hair clips so you do not get that uncomfortable headache after a few hours of wearing your costume."]
Shimmer Anne Shine Clip On Costume/Halloween Cat Ears
B00005OTJ8
[]
[]
SpongeBob Squarepants Comforter - Twin
B00006MVYA
[]
[]
Cedar Tie Hanger
B000072X6P
[]
[]
Perlina Morgan Collection Wallet on a String (Black )
B000072WZM
[]
[]
Trager Metropolitan

Showing the first 1000 rows.

A notebook made available by the data host indicates some entries may be invalid and should be removed. These records are identified by the presence of the getTime JavaScript method call in the title field:

# remove bad records and add ID for deduplication work
metadata = (
  raw_metadata
    .filter( instr(raw_metadata.title, 'getTime')==0 ) # unformatted title
    )
 
metadata.count()
Out[10]: 11815974

The dataset also contains a few duplicate entries based on the asin value:

# count number of records per ASIN value
(
  metadata
  .groupBy('asin')  
    .agg(count('*').alias('recs'))
  .filter('recs > 1')
  ).count()
Out[11]: 272563

Using an artificial id, we will eliminate the duplicates by arbitrarily selecting one record for each ASIN to remain in the dataset. Notice we are caching the dataframe within which this id is defined in order to fix its value. Otherwise, the value generated by monotonically_increasing_id() will be inconsistent during the self-join:

# add id to enable de-duplication and more efficient lookups (cache to fix id values)
metadata_with_dupes = (
  metadata
    .withColumn('id', monotonically_increasing_id())
  ).cache()
 
# locate first entry for each asin
first_asin = (
  metadata_with_dupes
    .groupBy('asin')
      .agg(min('id').alias('id'))
  )
 
# join to eliminate unmatched entries
deduped_metadata = (
  metadata_with_dupes
    .join(first_asin, on='id', how='leftsemi')
  )
 
deduped_metadata.count()
Out[12]: 11543411
# should return 0 if no duplicates
(
  deduped_metadata
  .groupBy('asin')
    .agg(count('*').alias('recs'))
  .filter('recs > 1')
  ).count()
Out[13]: 0

To make our next data processing steps easier to perform, we will persist the deduplicated data to storage. By using Delta Lake as the storage format, we are enabling a set of data modification statements which we will employ later:

# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS reviews.metadata')
 
# drop any old delta lake files that might have been created
shutil.rmtree('/dbfs/mnt/reviews/silver/metadata', ignore_errors=True)
 
# persist as delta table
(
  deduped_metadata
   .repartition(sc.defaultParallelism * 4)
   .write
   .format('delta')
   .mode('overwrite')
   .save('/mnt/reviews/silver/metadata')
  )
 
# make table queriable
_ = spark.sql('''
  CREATE TABLE IF NOT EXISTS reviews.metadata
  USING DELTA
  LOCATION '/mnt/reviews/silver/metadata'
  ''')
 
# show data
display(
  spark.table('reviews.metadata')
  )
 
id
asin
category
description
title
1
2
3
4
5
6
7
8
9
10
11
12
13
103079216646
B0006TZLMC
["Arts, Crafts & Sewing", "Painting, Drawing & Art Supplies", "Painting", "Paints"]
["0002386-1 Color: Transparent Yellow Iron Oxide Features: -Fluid acrylics are ideal for spraying, brushing, staining, and can be mixed with other Golden products. -Blend with airbrush medium for spray application. -Excellent for fabric application. -Category: Painting. -Age Group: Adult. Product Type: -Paint. Country of Manufacture: -United States. Dimensions: Overall Height - Top to Bottom: -3\". Overall Width - Side to Side: -1\". Overall Depth - Front to Back: -1\". Overall Product Weight: -0.1 lbs."]
Golden Fluid Acrylic Paint 1 Ounce-Transparent Yellow Iron Oxide
4552665358110
B00DQ3IXT8
["Kindle Store", "Kindle eBooks", "Health, Fitness & Dieting"]
[]
Mystical Chinese Diet - Kindle edition
3530463125852
B001B68ZHW
["Home & Kitchen", "Home Dcor", "Home Fragrance", "Incense & Incense Holders", "Incense"]
["Nippon Kodo's Morning Star Incense is one of the most popular brands of Japanese incense. Crafted from pure floral essences and rare aromatic woods, Morning Star can be enjoyed every day. Each box comes with a ceramic tile incense burner. Every item is made from the finniest materials including, 18/10 Stainless Steel, Porcelain, Hand Crafted Clay and many others. We strive to bring each consumer an item that has been tested and used by our staff and insured to be enjoyed by you and your family. Best selling incense for daily use. Morning star has been one of Nippon Kodo's best-selling products over the past 40 years."]
Morning Star Sandlewood Scent Japanese Incense Sticks, 6 Box of 50 Sticks (Sage, Sandalwood, Cedarwood, Musk, Frankincense and Myrrh)
2147483655482
B00LV9SC7C
["Clothing, Shoes & Jewelry", "Women", "Shoes", "Rubber", "Synthetic sole", "Exotic, relaxed sandal on a plush molded footbed", "Offered in snake print or flower print leather", "Suede-wrapped footbed with comfort latex cushion", "Brushed metal stud and buckle detail", "1 1/4 inches with a 3/4 inch platform"]
["Exotic, relaxed style on a plushmolded footbed. . . . Offered in snake print or flowerprint leather . Suede-wrapped footbed withcomfort latex cushion . Lightweight molded cork midsoledesigned to contour the foot . Molded lightweight outsole . Brushed metal stud and buckledetail . 1 1/4 inches with a 3/4 inchplatform </li></ul>"]
Sofft - Womens - Bettina
1357209682890
B009YSJMKS
["Cell Phones & Accessories", "Accessories", "Chargers & Power Adapters", "Car Chargers"]
["<b>2-Port USB Car Charger Vehicle Power Adapter - Silver(2000mA)</b> \n<br>Brand new Biru Gear charger. \n<br>Color:Silver. \n<br>Input: 12v DC. \n<br>Output: 5.0V / 2000mA. \n<br>Short circuit protection. \n<br>LED charging indication. \n<br>2 USB Port. \n<br>Plug never heats or overcharges the battery. \n<br>Intelligent IC chip inside recognizes a full battery and automatically switch to a saver. \n<br>You must have a USB charging cable (sold separately) in order to charge your device with this charger. \n<br> \n<br> The charger is only compatible to certain Tablet Model, please follow instruction very carefully. \n<br> USB Port label 1A able to charge : Samsung Galaxy Tab 7inch, 10.1 inch \n<br> USB Port label 2.1A able to charge : Apple iPad 1 and Apple iPad 2 \n<br> \n<b>Please Note :</b> the Car Charger \n<b> DO NOT</b> compatible to BlackBerry Cell Phone, Tablet / Acer Tablet / Asus Tablet / Toshiba Tablet \n<br> \n<br> \n<b>USB 2.0 A to Micro-USB B High Speed Retractable Cable</b> \n<br>Brand new high quality retractable generic cable. \n<br>Extend the cord with your desired length. \n<br>Connectors: USB Type A Male to Type B Micro-USB Male. \n<br>Connect phone, camera, eBook reader, and more with your PC/Laptop by USB port. \n<br>USB is supported on Windows 98/2000/ME/XP, not supported on Windows 95/98 1st Edition or Windows NT. \n<br>Transfer rates up to 480Mbps depending on USB version. \n<br>Foil and braid shielding reduces EMI/RFI interference thus ensuring high-speed, error-free data transfer. \n<br> \n<br>"]
BIRUGEAR Metallic Silver 2-Port USB Car Charger Adapter 2A + 3FT Retractable Cable for Samsung Galaxy S6/ S6 Edge/ S6 active, Galaxy Mega 2, Galaxy Note 5 4, Galaxy S5 / S4 / mini / zoom
352187320883
B000WEKQ3K
["Automotive", "Interior Accessories", "Floor Mats & Cargo Liners", "Floor Mats", "Custom Fit"]
["In the quest for the most advanced concept in floor protection, the talented designers and engineers at MacNeil Automotive have worked tirelessly to develop the most advanced DigitalFitfloor protection available today!. The WeatherTech FloorLinerDigit", "<div class=\"aplus\">", "In the quest for the most advanced concept in floor protection, the talented designers and engineers at MacNeil Automotive have worked tirelessly to develop the most advanced DigitalFit floor protection available today. WeatherTech FloorLiner Mats are designed to keep your interior carpet clean, with quality floor protection proudly designed, engineered and manufactured in the USA.", "The WeatherTech FloorLiner DigitalFit accurately and completely lines the interior carpet giving absolute interior protection. Digital laser measurements of interior surfaces offer a consistently perfect fit. A patent pending High-Density Tri-Extruded (HDTE) material allows for a rigid core for strength while offering surface friction to the carpet, as well as tactile feel to the surface. Advanced surfacing creates channels that carry fluids and debris to a lower reservoir with further channeling to help minimize fluid movement while driving. Once fluids become trapped in the reservoir, away from shoes and clothing, they are easily removed from the WeatherTech FloorLiner DigitalFit over the door sill...no muss, no fuss.", "WeatherTech FloorLiner mats are simple to install, with grommet holes that match the factory carpet in your vehicle. Note: FloorLiner is designed for vehicles with factory carpet installed when manufactured. They will not fit vehicles with vinyl floors unless otherwise specified. Please take your old floor mats out before installing your new WeatherTech FloorLiner.", "FloorLiner protects the resale value of your vehicle, too. Simply put, the cost of owning a vehicle is the difference between what you pay for it and what you sell it for. There are many factors that influence resale value; miles, exterior condition, maintenance history and of course interior condition. FloorLiner mats protect against some of the most common causes of interior damage, helping to keep your investment in good shape for years to come.", "The WeatherTech FloorLiner is guaranteed against faulty materials and workmanship for a period of three years from the date of purchase. Exclusions to this warranty are wear due to severe abrasive conditions and chemical contamination, such as spilled gasoline, bleach, etc. Depending on the situation, MacNeil Automotive will either replace your FloorLiner at no charge or charge you a prorated amount for a replacement."]
WeatherTech 440661 Custom Fit Front FloorLiners (Black)
2637109927413
B01CTA02WO
["Clothing, Shoes & Jewelry", "Women", "Shoes"]
[]
Jessica Simpson Mabelle Women US 5.5 Sandbar Flats
317827584442
B000K2LSZE
["Automotive", "Exterior Accessories", "Towing Products & Winches", "Hitch Accessories", "Hitch Covers"]
["Deer Hitch Cover<br><br> This clever motion hitch cover slids onto your ball hitch and plug into your 4-way trailer connector. Step on the brakes and it goes to work.<br><br> Deer hitch cover raises its front legs in surrender and the target on its belly lights up.<br><br> Hitch cover is 11\" tall and fits on standard 2\" ball hitches."]
Hitch Critters 3591 Don't Shoot Deer Moving Ball Hitch Cover and Brake Light
3770981295468
B009KSVPSY
["Home & Kitchen", "Kitchen & Dining", "Dining & Entertaining", "Serveware", "Serving Bowls & Tureens", "Tureens"]
["The Polish Pottery Dish is an exceptional piece of Polish Stoneware crafter`s artwork from Boleslawiec, Poland, and is ideal for serving appetizers or side dishes. Because of the high-quality clay used to make Polish Pottery from Boleslawiec, the serving dish is safe to put in the oven, stove or microwave. The glaze that covers it makes it dishwasher-safe and extremely easy to wash!"]
Polish Pottery Large Serving Platter Zaklady Ceramiczne Boleslawiec 1007-166a
1803886270199
B00A2RVDMK
["Clothing, Shoes & Jewelry", "Women", "Jewelry", "Necklaces", "Pendants"]
["you will like it , it is very beautiful!"]
Chaomingzhen Gold Plated Fox Necklace Pendant for Women Austrian Crystal Fashion Jewelry
4329327039455
B0171RZCY8
["Home & Kitchen", "Bedding", "Blankets & Throws", "Bed Blankets"]
["With personalized heat settings prepare for a restful night of soothing comfort and elegance. The thermofine warming system of the blanket monitors temperature and auto-adjusts to keep the level of warmth consistent throughout the night. Sunbeam blankets are a smart way to cut energy costs as well. They allow you to save up to 10 percent a year on your heating bills by turning on your heated blanket and setting your thermostat down by 10-15 percent for 8 hours while sleeping. Made of 100 percent polyester, the blanket presents a soft surface that is smooth and soothing to the touch. The blanket includes 10 heating levels, while an easy-to-operate control makes it simple to find exactly the right setting. The blanket is available in Twin, Full, Queen, and King sizes, with dual controls and split elements for the Queen and King sizes. For easy care, the blanket is fully machine washable and dryer safe. Other Bed Blankets 12"]
Channeled Microplush Electric Heated Blanket - King Size (Walnut)
6983616831866
B00004W5ZV
[]
[]
The Storytellas - Strong As Death
6846177878460
B01H2VC0MU
["Cell Phones & Accessories", "Accessories", "Accessory Kits"]
[]
LG G5 Flip Wallet Case Cover &amp; Screen Protector &amp; Charging Cable Bundle! A8183 Batman

Showing the first 1000 rows.

_ = metadata_with_dupes.unpersist()

With our deduplicated data in place, let's turn now to the cleansing of some of the fields. Taking a close look at the description field, we can see there are unescaped HTML characters and full HTML tags which we need to clean up. We can see some similar cleansing is needed for the title and the category fields. With the category field, it appears that the category hierarchy breaks down at a level where an HTML tag is encountered. For that field, we will truncate the category information at the point a tag is discovered.

To make this code easier to implement, we'll make use of a pandas UDF and update our data in place:

# pandas function to unescape HTML characters
@pandas_udf(StringType())
def unescape_html(text: pd.Series) -> pd.Series:
  return text.apply(html.unescape)
 
# register function for use with SQL
_ = spark.udf.register('unescape_html', unescape_html)
%sql
 
MERGE INTO reviews.metadata x
USING ( 
  SELECT -- remove HTML from feature array
    a.id,
    unescape_html(
      REGEXP_REPLACE(a.title, '<.+?>', '')
      ) as title
  FROM reviews.metadata a
  WHERE a.title RLIKE '<.+?>|&\\w+;'  -- contains html tags & chars
  ) y
ON x.id = y.id
WHEN MATCHED THEN
  UPDATE SET x.title = y.title
OK
%sql
 
MERGE INTO reviews.metadata x
USING ( 
  SELECT -- remove HTML from feature array
    a.id,
    COLLECT_LIST( 
      unescape_html(
        REGEXP_REPLACE(b.d, '<.+?>', '')
        )
      ) as description
  FROM reviews.metadata a
  LATERAL VIEW explode(a.description) b as d
  WHERE b.d RLIKE '<.+?>|&\\w+;'  -- contains html tags & chars
  GROUP BY a.id
  ) y
ON x.id = y.id
WHEN MATCHED THEN
  UPDATE SET x.description = y.description
OK
%sql
 
MERGE INTO reviews.metadata x
USING ( 
  SELECT  -- only keep elements prior to html
    m.id,
    COLLECT_LIST( unescape_html(o.c) ) as category
  FROM reviews.metadata m
  INNER JOIN (
    SELECT -- find first occurance of html in categories
      a.id,
      MIN(b.index) as first_bad_index
    FROM reviews.metadata a
    LATERAL VIEW posexplode(a.category) b as index,c
    WHERE b.c RLIKE '<.+?>'  -- contains html tags
    GROUP BY a.id
    ) n 
    ON m.id=n.id
  LATERAL VIEW posexplode(m.category) o as index,c
  WHERE o.index < n.first_bad_index
  GROUP BY m.id
  ) y
  ON x.id=y.id
WHEN MATCHED THEN
  UPDATE SET x.category=y.category
OK
spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', False)
_ = spark.sql('VACUUM reviews.metadata RETAIN 0 HOURS')

Let's now see how our cleansed metadata appears:

display(
  spark.table('reviews.metadata')
  )
 
id
asin
category
description
title
1
2
3
4
5
6
7
8
9
10
11
12
13
26
B000073249
[]
[]
Travelon Comfort Grip 3-in-1 Travel Handle
29
B000072XRF
[]
[]
La Regale Peau de Soie Chic Full Flap Handbag (Ivory)
964
B0006BHTG6
[]
["Every now and then a vest comes along that's so versatile, it becomes your favorite. The Stockman does just that... designed to be worn proudly as a single layer or as a valued layer of warmth. The shell is constructed from genuine Canadian Melton Wool, 24 oz. strong and lined with premium grade taffeta. Distinctly western with its hard-tailored Whiskey NuBuck trim featured on the front pocket welts and western back yok piping. Buttons are fine, however quick release solid brass logo belt lin for warth and overall comfort and mobility. Made in USA."]
Stockman Vest
1677
B000EFOSFO
[]
[]
Skechers Kids' Bikers Crossbar (Pink 12.0 M)
1697
B000EM16IY
[]
[]
Gem Avenue 925 Sterling Silver Cube Clear AB Crystal 8mm x 21mm Charm Swarovski Elements Pendant
1806
B000FCUNJQ
[]
["The Akademiks Solver Tee is 100% cotton with screenprinted graphics. Imported."]
Akademiks Men's Solver Tee
1950
B000G1OAX6
[]
[]
Littlearth Designer Fender Purse - Las Vegas (Las Vegas)
2040
B000GX7OW8
[]
["This stunning bracelet measures 7.2 inches in length and weighs 4.9 grams of 14K White Gold. This fashionable piece features a large lobster claw clasp for added security and comfort. Don't leave home without it! Netaya Style# 2CWBR50L/07.2"]
14K Rolo Link Bracelet 7.2"
2214
B000HX58J8
[]
["Gemstone Genuine Peridot 0.50 carat Round Cut 5.00mm vibrant color clean clarity Pendant pure sterling silver with .925 stamp 1.24 grams Dimension: 1/2 x 3/8 inch Includes 18 inches Silver Chain modern design featuring channel prong setting includes a gift pouch . Excellent Craftsmanship. Unique one of a Kind Design. Includes Free Shipping, Free Gift Box & Free 18 inch Silver Box Chain"]
Round Cut Peridot Pendant Sterling Silver
2250
B000I07050
[]
[]
SEDUCE 3000 5" Thigh Boot, Black Stretch Patent-Size 9
2453
B000IHQ0ZE
[]
["These gorgeous sparkling diamond earrings light up your face with a brilliance never seen before. They are hand crafted in stunning feminine gold mountings that display the diamonds to their maximum beauty. To be worn on any occasion, both casual and formal, these earrings will become an everyday part of your life.14K White Gold Princess Diamond 4-Prong Stud earrings (3/4ctw, G-H, SI2-I1) Netaya Style# 302020203075"]
14K White Gold Princess Diamond 4-Prong Stud Earrings (3/4ctw, G-H, SI2-I1)
2509
B000J28C7C
[]
[]
Smartwool Men's City Slicker Socks
2529
B000J41B2I
[]
[]
Go Gloves (Red)

Showing the first 1000 rows.

Step 3: Prep Reviews

As with our metadata files, there are only a limited number of fields in the reviews JSON documents relevant to our needs. We'll retrieve the ASIN for each product as well as the reviewer's ID and the time of their review. Fields such as whether a purchase was verified or the number of other users who found the review could be useful but we'll leave them be for now:

# common elements of interest from json docs
reviews_common_schema = StructType([
  StructField('asin', StringType()),
  StructField('overall', DoubleType()),
  StructField('reviewerID', StringType()),
  StructField('unixReviewTime', LongType())
  ])
 
# read json to dataframe
reviews = (
  spark
    .read
    .json(
      reviews_path,
      schema=reviews_common_schema
      )
    )
 
# present data for review
display(
  reviews
  )
 
asin
overall
reviewerID
unixReviewTime
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
7106116521
5
A1D4G1SNUZWQOT
1413763200
7106116521
2
A3DDWDH9PX2YX2
1411862400
7106116521
4
A2MWC41EW7XL15
1408924800
7106116521
2
A2UH2QQ275NV45
1408838400
7106116521
3
A89F3LQADZBS5
1406419200
7106116521
5
A29HLOUW0NS0EH
1405728000
7106116521
4
A7QS961ROI6E0
1401494400
B00007GDFV
3
A1BB77SEBQT8VX
1379808000
B00007GDFV
3
AHWOW7D1ABO9C
1374019200
B00007GDFV
3
AKS3GULZE0HFC
1365811200
B00007GDFV
4
A38NS6NF6WPXS
1362787200
B00007GDFV
2
A1KOKO3HTSAI1H
1359244800
B00007GDFV
1
A1G3S57JGZNPCL
1357257600
B00007GDFV
1
AGBL3TTP6GV4X
1343606400
B00007GDFV
4
A1Y36BSE9GKXLV
1268352000
B00007GDFV
3
A1L1U968VNYVA4
1509408000
B00007GDFV
5
A1NSKPSR0XZ0C9
1508803200
B00007GDFV
5
A3O5SXH5O8DWRP
1507852800

Showing the first 1000 rows.

A quick check for duplicates finds that we're have a little clean up to do. While maybe not truly duplicates, if a user submits multiple reviews on a single product, we want to take the latest of these as their go-forward review. From there, we want to make sure the system is not capturing multiple records for that same, last date and time:

# tack on sequential ID
reviews_with_duplicates = (
  reviews.withColumn('rid', monotonically_increasing_id())
  ).cache() # cache to fix the id in place
 
 
# locate last product review by reviewer
last_review_date = (
  reviews_with_duplicates
    .groupBy('asin', 'reviewerID')
      .agg(max('unixReviewTime').alias('unixReviewTime'))
  )
 
# deal with possible multiple entries on a given date
last_review = (
  reviews_with_duplicates
    .join(last_review_date, on=['asin','reviewerID','unixReviewTime'], how='leftsemi')
    .groupBy('asin','reviewerID')
      .agg(min('rid').alias('rid'))
    )
 
# locate last product review by a user
deduped_reviews = (reviews_with_duplicates
  .join(last_review, on=['asin','reviewerID','rid'], how='leftsemi')
  .drop('rid')
  )
 
display(deduped_reviews)
 
asin
reviewerID
overall
unixReviewTime
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
0000031852
A3URQ0LXLV46E9
4
1400544000
0000031887
ATU364KXBXX8P
5
1426464000
0001526146
A1W23DEQW7ZK19
5
1426377600
0005019281
A1IR7ZV3NUY7N7
5
1482883200
0005019281
A1OX08NOF3NWXR
3
1420070400
0005019281
A2U7DG83EXUSFP
5
1516838400
0005019281
A312KM8IVZCEPA
5
1387929600
0005119367
A14CRYKD47EZFX
5
1471564800
0005119367
A3FS7RSDQ7JSWC
5
1409702400
0005164885
A235TIADWGJL6Q
1
1208736000
0005164885
A2C68E6GNW4RW6
5
1264032000
0005419263
A1TN0V94A3ECSH
5
1254009600
0005452287
AX5JHLJ02KJG3
5
1429833600
0101635370
AFNLXOHWU8WXQ
1
1426550400
0140503528
A14JXAEQKIQDZG
5
1504742400
0140503528
A2Z97U5Q8APOG1
5
1436745600
0140503528
A3FFJLY0W2DG6L
5
1497052800
014789302X
A18M3W71X8TYY6
5
1506211200

Showing the first 1000 rows.

Let's now persist our data to a Delta Lake table before proceeding:

# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS reviews.reviews')
 
# drop any old delta lake files that might have been created
shutil.rmtree('/dbfs/mnt/reviews/silver/reviews', ignore_errors=True)
 
# persist reviews as delta table
(
  deduped_reviews
   .repartition(sc.defaultParallelism * 4)
   .write 
   .format('delta')
   .mode('overwrite')
   .save('/mnt/reviews/silver/reviews')
  )
 
# make table queriable
_ = spark.sql('''
  CREATE TABLE IF NOT EXISTS reviews.reviews
  USING DELTA
  LOCATION '/mnt/reviews/silver/reviews'
  ''')
_ = reviews_with_duplicates.unpersist()

To make joining to the product metadata easier, we'll add the product ID generated earlier with our metadata to our reviews table: