You often need to scrap and process information from Wikipedia articles in your Python applications. This article explains the working of the Python Wikipedia Module which is one of the most commonly used Python modules for scraping Wikipedia articles.
Summary
- Installing the Library
- Scraping Page Suggestions
- Scraping Page Summary
- Scraping Elements within a Wikipedia Page
- Changing Page Language
Installing the Library
To install the Wikipedia library, execute the following command on your command terminal:
pip install wikipedia
Searching Page Names
Based on your search query, the search() method of the Wikipedia module can return several pages. For instance, in the following script, we search for the Wikipedia page on “eiffel tower”
print(wikipedia.search("eiffel tower"))
The output below all the Wikipedia pages that contain the term “eiffel tower”.
Output:
['Eiffel Tower', 'List of the 72 names on the Eiffel Tower', 'Under the Eiffel Tower', 'Gustave Eiffel', 'Eiffel Tower (disambiguation)', 'Eiffel Tower replicas and derivatives', 'Eiffel Tower (Paris, Tennessee)', 'The Man on the Eiffel Tower', 'Eiffel Tower (Paris, Texas)', 'Exposition Universelle (1889)']
You can limit the results of the search() method by passing a value for the results attribute of the search() method. For instance, the following script returns only 5 pages that contain the term “eiffel tower”.
print(wikipedia.search("eiffel tower", results = 5))
Output:
['Eiffel Tower', 'Gustave Eiffel', 'Under the Eiffel Tower', 'List of the 72 names on the Eiffel Tower', 'Eiffel Tower (disambiguation)']
Scraping Page Suggestions
In case if you do not exactly know the name of the webpage, you can use the suggest() method which suggests the most relevant Wikipedia page according to your search query. For instance, in the following script, the term “eiffel towr” is passed to the suggest() method. Though the spellings of the word “towr” are not correct, the suggest() method is able to suggest the page “eiffel tower.”
print(wikipedia.suggest("eiffel towr"))
Output:
eiffel tower
Scraping Page Summary
To scrape the summary of a Wikipedia page, you can use the summary() method. To get the summary, the name of the page has to be passed to the summary() method of the Wikipedia module. The following script returns a summary of the Wikipedia page on “eiffel tower”.
print(wikipedia.summary("eiffel tower"))
Output:
You can also limit the number of sentences in a page’s summary by passing an integer value for the sentences attribute of the summary() method. For example, the following script returns the first five sentences of the summary of the Wikipedia article on “eiffel tower”.
print(wikipedia.summary("eiffel tower", sentences = 5))
Scraping Elements within a Wikipedia Page
With the Wikipedia module, you can scrap different elements of a Wikipedia page e.g. page title, page URL, links and references in a page, images on a page, etc.
To access page elements, first, you have to create an object of the WikipediaPage class for a certain Wikipedia page. The page() method returns an object of the WikipediaPage class. The following script creates an object for WikipediaPage class for “eiffel tower”.
et_page = wikipedia.page("eiffel tower")
The WikipediaPage class object can now be used to access various page elements.
Scraping Page Title
To access page title, the title attribute of the WikipediaPage class is used. The following script prints the page title for the Wikipedia page on “eiffel tower”.
print(et_page.title)
Output:
Eiffel Tower
Scraping Page Content
To scrape all the contents of a Wikipedia page, you can use the content attribute of the WikipediaPage class. The following script prints the page content for the Wikipedia page on “eiffel tower”.
print(et_page.content)
Output:
Scraping Page URL
To scrape a page URL, the URL attribute of the WikipediaPage class is used. Here is an example:
print(et_page.url)
Output:
https://en.wikipedia.org/wiki/Eiffel_Tower
Scraping Page References
A Wikipedia page contains several references. You can get all the references in the form of a list using the reference attribute. The following script returns all the references from Wikipedia page on “eiffel tower”. The script iterates through the list of references using a for loop and prints the link on the console.
refs = et_page.references for i in refs: print(i)
The output shows some of the references from Wikipedia’s page on “eiffel tower”.
Output:
Scraping Page Links
In addition to scraping complete references, you can also scrap the names of the pages that are referred to inside a Wikipedia page. To do so, you can use the links attribute as shown in the following script.
refs = et_page.links for i in refs: print(i)
Output:
Scraping Page Images
A Wikipedia page may contain one or more images. To retrieve all the images from a Wikipedia page, you can use the images attribute. The following script returns all the images links from Wikipedia’s page on “eiffel tower”. The script iterates through the list of image links using a for loop and prints the link text on the console.
images = et_page.images for i in images: print(i)
Output:
Since the images attribute returns a list, you can access an individual image via the index number for the image. For instance, the following script returns the image URL for the 3rd image on Wikipedia’s page on Eiffel tower.
print(et_page.images[2])
Output:
https://upload.wikimedia.org/wikipedia/commons/7/7f/Caricature_Gustave_Eiffel.png

If you paste the above link in a browser, you should see the following image:
Changing Page Language
You can also change the language in which the Wikipedia module scraps a Wikipedia page. Currently, 451 languages are supported by the Wikipedia module. To print the symbol and names of all the languages, you can call the languages() method of the Wikipedia module as shown below.
print(wikipedia.languages())
Output:
To change the language for results, you need to pass the language symbol to the set_lang() method of the Wikipedia module. The following script changes the language for the Wikipedia module to Spanish and then prints the summary of the Wikipedia page on “torre eiffel”. The output shows that the summary is printed in Spanish.
wikipedia.set_lang("es") print(wikipedia.summary("torre eiffel", sentences = 2))
Output: