By Alexander Engelhardt, freelance Data Scientist
In this post I present a Python script that automatically generates suggestions for startup names. You feed it a text corpus with a certain theme, e.g. a Celtic text, and it then outputs similar sounding suggestions. An example call looks like this:
./generate.py -n 10 -t 1.2 -m models/gallic_500epochs.h5 wordlists/gallic.txt --suffix Software
I applied the script to “normal” texts in English, German, and French, and then experimented with corpora of Celtic songs, Pokemon names, and J.R.R. Tolkien’s Black Speech, the language of Mordor.
I’ve made a few longer lists of sampled proposals available here.
You can find the code, all the text corpora I’ve used, and some pre-computed models in my GitHub repo:
Recently, an associate and I started to found a software company, but most name ideas we came up with were already in use. We wanted a name with a Celtic touch, and we needed a large number of candidates to find one that was still available.
So I started to create a neural network that generates new, artificial words. You’d feed it a corpus of sample words in a certain style you like, for example, Celtic songs, and it would be able to pick up the peculiarities of the text (the “language”) and generate new similar sounding words. The famous blog post by Andrej Karpathy provided me with the necessary knowledge and the confidence that this is a realistic idea.
I first built a corpus of raw text. For preprocessing, I stripped out all non-letter characters. Then, I split the text up in words and kept only unique words. I figured this step was reasonable since I did not want the model to learn the most common words, but instead to get an understanding of the entire corpus’ structure.
After this, most text corpora ended up as a list of 1000 to 2000 words.
The RNN architecture
Recurrent neural networks can model language particularly well, and were the appropriate type for this task of word generation. However, finding the ‘perfect’ RNN architecture is still somewhat of a black art. Questions like how many layers or how many units to use have no definite answer, but rely on experience and intuition.
To save training time, I wanted a model that was as complex as necessary, but as simple as possible. I settled for a character-level two-layer LSTM with 50 units each, training it for 500 epochs. The words this model outputs sound very good already.
The RNN generates a new name character by character. It doesn’t just output the next character, but the distribution for the next character. This allows us to either just pick the letter with the highest probability, or to sample from the provided distribution.
A nice touch I found is to vary the temperature of the sampling procedure. The temperature is a parameter that adapts the weights to sample from. The “standard” temperature 1 does not change the weights. For a lower temperature, the sampling becomes less random, i.e. more conservative, and the letter corresponding to the maximum weight is chosen almost always. The other extreme, a large temperature, will adjust the weights approaching a uniform distribution, representing total randomness. For actual text sampling, a temperature below 1 might be appropriate, but since I wanted new words, a higher temperature seemed better.
In the image above, imagine we want to sample one letter from A, B, …, J. The original weights output by your RNN might be the red bars. If you lower the temperature, the weights change to the yellow-ish bars (temperature = 0.1), and if you raise it, they change to the green bars (temperature = 3).
Example calls and sample output
Call the script with the -h parameter to have it print an overview of all possible parameters. The following command trains an LSTM model on the wordlists/english.txt corpus for 500 epochs (-e 500), saves the model (-s) to models/english_500epochs.h5, and then samples 10 company names (-n 10) with a temperature of 1.2 (-t 1.2), and finally appends the word “Software” (--suffix) to the names (I found a long list of possible suffixes here). While training, I like to pass the -v argument to run in verbose mode. Then, the model prints some extra information as well as a few sample generated words each 10 epochs:
./generate.py -v -e 500 -n 10 -t 1.2 -s models/english_500epochs.h5 wordlists/english.txt --suffix Software
My call returned these suggestions:
Some other fine name suggestions I encountered, and are too good not to share:
- Indeed (see, it works!)
- Unifart (I dare you!)
After you stored the model (with the -s option), word generation is quicker when you load instead of re-compute the model (using the -m instead of the -s argument).
More exotic corpora
I collected a text each in German, English and French, just to have some realistic sounding words and gauge how well the model has learned the corpus structure.
However, most of my time was then spent on more fun corpora. Below, I’ll briefly describe them and also show some randomly sampled output for the generated words.
This corpus is composed of a Gallic dictionary, and selected song lyrics in the Gaulish language by Eluveitie:
If you provide a list of all Pokemon, you get Pokemon-themed names:
Tolkien’s Black Speech
Tolkien’s Black Speech, the language of Mordor, was a just-for-fun experiment:
This tool has already proved useful for us, coming up with a few very pleasant-sounding names. It might prove helpful to others too, I
Bio: Alexander Engelhardt, recently earned a MSc and PhD degree in statistics at the LMU in Munich, and then became a freelance data scientist specialized in machine learning with R. Alexander recently caught the open source bug and started contributing to the R package ‘mlr’. This post introduces my first own personal side project.
Original.Reposted with permission.
- Ten Machine Learning Algorithms You Should Know to Become a Data Scientist
- Getting Started with PyTorch Part 1: Understanding How Automatic Differentiation Works
- Practical Data Science: Building Minimum Viable Models