How to obtain a list of titles of all Wikipedia articles
The allpages
API module allows you to do just that. Its limit (when you set aplimit=max
) is 500, so to query all 4.5M articles, you would need about 9000 requests.
But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0
which, as its name suggests, contains exactly what you want (59 MB of gzipped text).
Right now, as per the current statistics the number of articles is around 5.8M. To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:
# get the list of all wikipedia pages (articles) -- English
import sys
from simplemediawiki import MediaWiki
listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")
wiki = MediaWiki('https://en.wikipedia.org/w/api.php')
continueParam = ''
requestObj = {}
requestObj['action'] = 'query'
requestObj['list'] = 'allpages'
requestObj['aplimit'] = 'max'
requestObj['apnamespace'] = '0'
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
numQueries = 1
while len(pagelist['query']['allpages']) > 0:
requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
# print writestr
numQueries += 1
if numQueries % 100 == 0:
print "Done with queries -- ", numQueries
print numQueries
listOfPagesFile.close()
The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.
I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.
I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:
requestObj['apfilterredir'] = 'nonredirects'
After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.
With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.
Is there any other option that I should be trying to get the actual (~5.8M) set of page names?