How to Make Wordcloud on Python using Text File

H

Wordclouds can be a great representation of the thought process of an individual. In this guide, we will use Python to create a wordcloud from a text corpus. Wordcloud is used to identify the most commonly used words in the text. In this tutorial, I will use a text corpus of Angela Merkel’s speech from the past 4 years to make wordcloud on python.

A wordcloud lets you evaluate the text by pointing out the keywords. In turns, give a good idea of what’s inside the text. The more the frequency of a specific word. The more it will appear.

In this guide, I am using Python to create an image file of wordcloud. The development is done on Pycharm, you can also find the code linked to this post.

Following are the libraries that we are going to use:


import matplotlib.pyplot as plt
from wordcloud import WordCloud 
import stop_words as sw
import numpy as np
from PIL import Image

If you don’t have these libraries installed. you can install them by the following commands:

pip install matplotlib

Just like the install the other libraries as well.

  • matplotlib: It is a famous library that is used to plot graphs and handle data.
  • wordcloud: as the name states, we are going to use it for the creation of wordcloud.
  • stopwords: Although, there exists a library to remove stopwords in the wordcloud library itself. Yet, I found this library a lot more efficient.
  • numpy: For masking the text into the shade of an image, we are using numpy.
  • PIL from Image: we are also using it for a fancy wordcloud, which will have a background image and different aesthetics to make it look better.

Now let’s move on to the guide on how to make wordcloud on python using text:

First, let’s talk about the main function where we are processing the text. I have scrapped the texts of Angela Merkel’s speeches from the last four years and stored them in a text (.txt) file.

Stop words are the prepositions and adjectives that do not add to the text. So we remove them. stop_words library has Stop words from a lot of languages from around the world. In this guide, first of all, we will remove the stop words of german words from our text file.

So we have opened the file, removed the stop words and send it as an argument to the wordcloud function.


if __name__ == '__main__':
    stop_words = sw.get_stop_words('german')
    print(stop_words)
    file = open("AngelaMerkel.txt")
    line =file.read()
    words = line.split()
    for r in words:
        if not r in stop_words:
            appendFile = open('filteredtext.txt', 'a')
            appendFile.write(" " + r)
            appendFile.close()
    with open('filteredtext.txt', 'r') as txt_file:
        filteredtext = txt_file.read()
    wordcloud(filteredtext,stop_words)

Now, we have opened the files, removed the stop words. It’s time to map the wordcloud graph. Now initially, we are setting the figure size and dimensions.

mask is the shadow of an icon or image that we are using for the text to take shape of it. For example, in the picture below, you can see the shape of wordcloud as of the masked image.

 


def wordcloud(text,stopwordss):
    # Set figure size
    my_mask = np.array(Image.open('angel.png'))
    wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, background_color='black',
                        colormap='rainbow', collocations=False, 
                             stopwords=stopwordss,mask=my_mask).generate(text)
    # Display image
    plt.figure(figsize=(40, 30))
    plt.axis("off")
    wordcloud.to_file("wordcloud1.png")

Using the arguments in the WordCloud() function. You can change the wordcloud accordingly. You can change the background color as well as colormap. The reference to the colormap can be found on matplotlib website. Alongside, we are using the stopwords from the WordCloud function as well to further improve the results. Once the parameters are set. We are using generate() and passing the contents of text corpus as text.

If you want to make a wordcloud yourself, you can fork the Sourcecode from here on Github.

That is all from my side, making a wordcloud is pretty easy. The complicated part is to scrap the data. You can find the source code on my Github repository as well. If you have questions about the code above or python. Feel free to reach out to us. If you have made a cloudmap using this guide, and you would like to share it with us. We would love to see your work.

About the author

Add comment