web scraping google news with python
You can use awesome requests library:
import requests
URL = 'https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q={query}&as_occt=any&as_drrb=b&as_mindate={month}%2F%{from_day}%2F{year}&as_maxdate={month}%2F{to_day}%2F{year}&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
def run(**params):
response = requests.get(URL.format(**params))
print response.content, response.status_code
run(query="Egypt", month=3, from_day=2, to_day=2, year=13)
And you'll get status_code=200.
And, btw, take a look at scrapy project. Nothing makes web-scraping more simple than this tool.
You can use google-search-results
package to extract data from Google News. It's a paid API with a free trial.
Check a demo on Repl.it
from serpapi import GoogleSearch
import os
month = 4
from_day = 2
to_day = 3
year = 2020
params = {
"engine": "google",
"q": "Trump",
"google_domain": "google.com",
"tbm": "nws",
"tbs": f"cdr:1,cd_min:{month}/{from_day}/{year},cd_max:{month}/{to_day}/{year}",
"api_key": os.getenv("API_KEY"),
}
client = GoogleSearch(params)
data = client.get_dict()
print(f"Raw HTML: {data['search_metadata']['raw_html_file']}")
print(f"JSON endpoint: {data['search_metadata']['json_endpoint']}")
print()
print("News results")
for result in data['news_results']:
print(f"""
Title: {result['title']}
Snippet: {result['snippet']}
Date: {result['date']}
""")
Part of JSON response
{
"news_results": [
{
"position": 1,
"title": "Trump Promotes Oil Deal That May Not Exist",
"link": "https://www.nytimes.com/2020/04/02/us/politics/trump-russia-saudi-arabia-oil.html",
"source": "The New York Times",
"date": "15 hours ago",
"snippet": "WASHINGTON — When oil prices crashed in early March after a dispute between \nRussia and Saudi Arabia, President Trump put a positive spin on the news.",
"thumbnail": ""
},
{
"position": 2,
"title": "Trump’s Oil Summit",
"link": "https://www.wsj.com/articles/trumps-oil-summit-11585870063",
"source": "Wall Street Journal",
"date": "Opinion · 16 hours ago",
"snippet": "Trump's Oil Summit. Tariffs and quotas won't solve a price shock caused by \na pandemic and a Saudi Arabia-Russia feud.",
"thumbnail": ""
}
]
}
Output
News results
Title: Trump Promotes Oil Deal That May Not Exist
Snippet: WASHINGTON — When oil prices crashed in early March after a dispute between
Russia and Saudi Arabia, President Trump put a positive spin on the news.
Date: 15 hours ago
Title: Trump’s Oil Summit
Snippet: Trump's Oil Summit. Tariffs and quotas won't solve a price shock caused by
a pandemic and a Saudi Arabia-Russia feud.
Date: Opinion · 16 hours ago
Title: OPEC and allies reportedly set for video meeting as analysts pour
skepticism on Trump's intervention
Snippet: “Donald Trump's tweet … It's nonsense, really,” Patrick Armstrong, chief
investment officer at Plurimi Investment Managers, told CNBC's “Squawk Box
Europe” on ...
Date: 5 hours ago
Title: Trump again tests negative for coronavirus
Snippet: President Donald Trump on Thursday again tested negative for the
coronavirus after being tested by the White House physician, according to
two White House ...
Date: 17 hours ago
Title: Trump passes the buck as deadly ventilator shortage looms
Snippet: (CNN) President Donald Trump is pinning the blame on states for a shortage
of ventilators that governors warn could effectively condemn coronavirus
patients to ...
Date: 10 hours ago
If you want more information, check out SerpApi documentation or live playground.
Disclosure: I work for SerpApi.
hi you can scrap like this with easy way
from bs4 import BeautifulSoup
import requests
url="https://news.google.co.in/"
code=requests.get(url)
soup=BeautifulSoup(code.text,'html5lib')
for title in soup.find_all('span',class_="titletext"):
print title.text