You often need to scrap and process information from Wikipedia articles in your Python applications. This article explains the working of the Python Wikipedia Module which is one of the most commonly used Python modules for scraping Wikipedia articles.

Summary

  1. Installing the Library
  2. Scraping Page Suggestions
  3. Scraping Page Summary
  4. Scraping Elements within a Wikipedia Page
  5. Changing Page Language

Python Wikipedia Module to Scrape Wikipedia

Installing the Library

To install the Wikipedia library, execute the following command on your command terminal:

pip install wikipedia

Searching Page Names

Based on your search query, the search() method of the Wikipedia module can return several pages. For instance, in the following script, we search for the Wikipedia page on “eiffel tower”

print(wikipedia.search("eiffel tower"))

The output below all the Wikipedia pages that contain the term “eiffel tower”.
Output:

['Eiffel Tower', 'List of the 72 names on the Eiffel Tower', 'Under the Eiffel Tower', 'Gustave Eiffel', 'Eiffel Tower (disambiguation)', 'Eiffel Tower replicas and derivatives', 'Eiffel Tower (Paris, Tennessee)', 'The Man on the Eiffel Tower', 'Eiffel Tower (Paris, Texas)', 'Exposition Universelle (1889)']

You can limit the results of the search() method by passing a value for the results attribute of the search() method. For instance, the following script returns only 5 pages that contain the term “eiffel tower”.

print(wikipedia.search("eiffel tower", results = 5))

Output:

['Eiffel Tower', 'Gustave Eiffel', 'Under the Eiffel Tower', 'List of the 72 names on the Eiffel Tower', 'Eiffel Tower (disambiguation)']

Scraping Page Suggestions

In case if you do not exactly know the name of the webpage, you can use the suggest() method which suggests the most relevant Wikipedia page according to your search query. For instance, in the following script, the term “eiffel towr” is passed to the suggest() method. Though the spellings of the word “towr” are not correct, the suggest() method is able to suggest the page “eiffel tower.”

print(wikipedia.suggest("eiffel towr"))

Output:

eiffel tower

Scraping Page Summary

To scrape the summary of a Wikipedia page, you can use the summary() method. To get the summary, the name of the page has to be passed to the summary() method of the Wikipedia module. The following script returns a summary of the Wikipedia page on “eiffel tower”.

print(wikipedia.summary("eiffel tower"))

Output:
Wikipedia Module
You can also limit the number of sentences in a page’s summary by passing an integer value for the sentences attribute of the summary() method. For example, the following script returns the first five sentences of the summary of the Wikipedia article on “eiffel tower”.

print(wikipedia.summary("eiffel tower", sentences = 5))

Output:
Scrape Articles

Scraping Elements within a Wikipedia Page

With the Wikipedia module, you can scrap different elements of a Wikipedia page e.g. page title, page URL, links and references in a page, images on a page, etc.

To access page elements, first, you have to create an object of the WikipediaPage class for a certain Wikipedia page. The page() method returns an object of the WikipediaPage class. The following script creates an object for WikipediaPage class for “eiffel tower”.

et_page = wikipedia.page("eiffel tower")

The WikipediaPage class object can now be used to access various page elements.

Scraping Page Title

To access page title, the title attribute of the WikipediaPage class is used. The following script prints the page title for the Wikipedia page on “eiffel tower”.

print(et_page.title)

Output:

Eiffel Tower

Scraping Page Content

To scrape all the contents of a Wikipedia page, you can use the content attribute of the WikipediaPage class. The following script prints the page content for the Wikipedia page on “eiffel tower”.

print(et_page.content)

Output:
Scrape Articles on Wikipedia

Scraping Page URL

To scrape a page URL, the URL attribute of the WikipediaPage class is used. Here is an example:

print(et_page.url)

Output:

https://en.wikipedia.org/wiki/Eiffel_Tower

Scraping Page References

A Wikipedia page contains several references. You can get all the references in the form of a list using the reference attribute. The following script returns all the references from Wikipedia page on “eiffel tower”. The script iterates through the list of references using a for loop and prints the link on the console.

refs = et_page.references

for i in refs:
    print(i)

The output shows some of the references from Wikipedia’s page on “eiffel tower”.
Output:
Scrape Wikipedia Articles

Scraping Page Links

In addition to scraping complete references, you can also scrap the names of the pages that are referred to inside a Wikipedia page. To do so, you can use the links attribute as shown in the following script.

refs = et_page.links

for i in refs:
    print(i)

Output:
Wikipedia Articles Scraper

Scraping Page Images

A Wikipedia page may contain one or more images. To retrieve all the images from a Wikipedia page, you can use the images attribute. The following script returns all the images links from Wikipedia’s page on “eiffel tower”. The script iterates through the list of image links using a for loop and prints the link text on the console.

images = et_page.images

for i in images:
    print(i)

Output:
Wikipedia Page Images
Since the images attribute returns a list, you can access an individual image via the index number for the image. For instance, the following script returns the image URL for the 3rd image on Wikipedia’s page on Eiffel tower.

print(et_page.images[2])

Output:

https://upload.wikimedia.org/wikipedia/commons/7/7f/Caricature_Gustave_Eiffel.png
Gustave Eiffel Wikipedia
Image from Wikipedia

If you paste the above link in a browser, you should see the following image:

Changing Page Language

You can also change the language in which the Wikipedia module scraps a Wikipedia page. Currently, 451 languages are supported by the Wikipedia module. To print the symbol and names of all the languages, you can call the languages() method of the Wikipedia module as shown below.

print(wikipedia.languages())

Output:
Wikipedia Scraper
To change the language for results, you need to pass the language symbol to the set_lang() method of the Wikipedia module. The following script changes the language for the Wikipedia module to Spanish and then prints the summary of the Wikipedia page on “torre eiffel”. The output shows that the summary is printed in Spanish.

wikipedia.set_lang("es")  
print(wikipedia.summary("torre eiffel", sentences = 2))

Output:
Machine Learning Wikipedia