Getting My Feet Wet with Deep Learning and Word2Vec

After listening to this episode of Practical AI, I got interested in experimenting with Word2Vec. According to wikipedia, Word2vec is a group of related models that are used to produce word embeddings.

I’m not sure exactly what that means, but from what I understand, it can be used to make educated guesses about related words. For example, GIPHY uses Word2Vec to create better suggestions for searches and GIFs.

One of the first times I heard about Word2Vec was around a now-famous example: king - man + woman = queen.

Turning words into vectors means they can be added and subtracted.

I’m still a novice at this, but we can explore Deep Learning and Word2Vec together! Let me show you what I was able to do!

The Idea: How to explore Deep Learning with a fun topic

I wanted to find related words in the Bob’s Burgers scripts. This originally started with some infatuation with their Burger of the Day list. I was actually able to find them all online, which led me down the rabbit hole of looking at all their scripts.

One thing I did know about deep learning beforehand is that getting data and cleaning it up is a good chunk of the work. It is also the part I’m most comfortable with.

I read Gensim’s Word2Vec Tutorial, which gave me a lot of help figuring out how to get started.

Step 1: Collect the Data

I found a source for the scripts of all Bob’s Burgers episodes. They were on a website in HTML so I fired up BeautifulSoup (this is all in Python, btw) and began scraping the website for scripts. Once I had them all, I saved them for further cleaning.

Step 2: Cleaning the Data

Once I had all the scripts downloaded in a folder as text files, I had to clean them.

They came in one big blog of text, but I wanted to split that into lines. I chose to make each sentence a line. I then saved them back into their original file.

def fix_scripts():
    from os.path import isfile, join

    script_paths = [f for f in listdir(
        saved_scripts_dir) if isfile(join(saved_scripts_dir, f))]

    for script in script_paths:
        file_path = join(saved_scripts_dir, script)

        file = open(file_path, "r+")
        text = file.read()
        file.close()

        text = text.strip()
        text = text.replace(". ", ".\n")

        file = open(file_path, "w")
        file.write(text)
        file.close()

Step 3: Preparing the Data

Now that we have our scripts formatted the way we want, it’s time to prepare the data for use.

def preprocess_scripts():
    import gensim
    from os.path import isfile, join

    script_paths = [f for f in listdir(
        saved_scripts_dir) if isfile(join(saved_scripts_dir, f))]

    for script in script_paths:
        file_path = join(saved_scripts_dir, script)
        with open(file_path, 'rb') as f:
            for i, line in enumerate(f):
                # do some pre-processing and return list of words for each review
                # text
                yield gensim.utils.simple_preprocess(line)

Step 4: Training the Model

The next step is training the model using our data. We will also save the model for use later. This makes it so we don’t need to retrain the model later.

def generate_model(documents):
    from gensim.models import word2vec

    model = word2vec.Word2Vec(
        documents,
        size=150,
        window=10,
        min_count=2,
        workers=10)
    model.train(documents, total_examples=len(documents), epochs=50)

    abspath = os.path.dirname(os.path.abspath(__file__))
    model.save(os.path.join(abspath, "vectors/burger_scripts"))
    return model

Step 5: Using the Model

Now we can use the above to make some similar word guesses.

What words relate to burger?

model.wv.most_similar(positive='burger)
Results for burger
[   ('burgers', 0.4621014893054962),
    ('chive', 0.39505648612976074),
    ('chard', 0.37232810258865356),
    ('oregano', 0.3668360710144043),
    ('choke', 0.3612610399723053),
    ('korea', 0.3503205180168152),
    ('tuna', 0.3428816497325897),
    ('bob', 0.335074782371521),
    ('melt', 0.3338084816932678),
    ('eat', 0.32058027386665344)]

From the results, we get burgers. I wonder if there is a way to take plurals into account? We also get some of the ingredients that I’d think would be included in some of Bob’s Burgers of the Day. Also eat. Yes, you eat burgers. They are delicious.

How about results for Bob himself?

model.wv.most_similar(positive='bob)
Results for bob
[   ('teddy', 0.3868436813354492),
    ('burgers', 0.3845985233783722),
    ('flipping', 0.37440577149391174),
    ('sal', 0.36525800824165344),
    ('bobby', 0.3411148488521576),
    ('burger', 0.335074782371521),
    ('linda', 0.32991185784339905),
    ('lin', 0.32708919048309326),
    ('restaurant', 0.3187229037284851),
    ('poplopovich', 0.30252140760421753)]

Bob’s best friend comes first. How nice! We also see Linda, his wife. Linda’s name for Bob is Bobby, so that shows up here as well.

One more trick. Let’s revisit king – man + woman = queen and see if bob - man + woman = linda.

 result = model.wv.most_similar(
        positive=['woman', 'bob'], negative=['man'])
    print("{}: {:.4f}".format(*result[0]))
gayle: 0.3387

…We have some work to do. Gayle is Linda’s sister and Bob’s sister-in-law.

The results are good, but because our dataset is small, they are not the greatest. More reason for Fox to keep Bob’s Burgers on the air!

More Posts by Bryan Joseph: