Skip to content

Generating business embeddings using the Yelp Academic Dataset

Notifications You must be signed in to change notification settings

acocos/business_embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Yelp Dataset Challenge: Business Embeddings

This repo contains code used to generate business embeddings for the Yelp Academic Dataset as detailed in this blog.

Contents

file/directory description
src/pipeline.sh This script demonstrates how to extract business/context pairs from the Yelp data and use them to train word embeddings using word2vecf. You can run this pipeline (after downloading the Yelp data and installing word2vecf, see below). Or you can just download the resulting vectors here and get started.
src/extract_contexts.py Generates business/context pairs from the Yelp data
src/infer.py Script from the original word2vecf code, useful for loading and manipulating the resulting vectors
src/examine_places.py Script used to produce results given in blog post
data/ Download the Yelp data and extract it to this directory
data/processed If you run pipeline.sh, the resulting vectors will end up here. Or you can download them and put them there on your own.

Getting started

  1. The code in this repo depends on the word2vecf adaptation of the popular word2vec software, allowing the use of arbitrary contexts to train vectors. It was developed by researchers at Bar-Ilan University and is available here. You'll need to download and install before running the pipeline to train the vectors on your own.
  2. If you want to train your own vectors, you'll also need to download the Yelp data and extract it to the ./data directory.
  3. Once you have done those two things, you will be able to run src/pipeline.sh to generate your own business embeddings.

About

Generating business embeddings using the Yelp Academic Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published