How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?

There is a much, much more easy way to get information from wikipedia - Wikipedia API.

There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

page = wiki_wiki.page('Mathematics')
print(page.summary)

Prints:

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning") includes the study of such topics as quantity, structure, space, and change...(omitted intentionally)

And, in general, try to avoid screen-scraping if there's a direct API available.

select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.

import bs4
import requests


response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
    html = bs4.BeautifulSoup(response.text, 'html.parser')

    title = html.select("#firstHeading")[0].text
    paragraphs = html.select("p")
    for para in paragraphs:
        print (para.text)

    # just grab the text up to contents as stated in question
    intro = '\n'.join([ para.text for para in paragraphs[0:5]])
    print (intro)

Use the library wikipedia

import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)

How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?

Tags:

Python

Html

Web Scraping

Beautifulsoup

Wikipedia

Related

Recent Posts