How to Scrape Easy Way to Google News SERP in 2023?
What will be scrapped :
Prerequisites
Install libraries:
pip install requests bs4 google-search-results
google-search-results
is a SerpApi API package.
Basic knowledge scraping with CSS selectors
CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
If you haven’t scraped with CSS selectors, there’s a dedicated blog post of mine
about how to use CSS selectors when web-scraping that covers what it is, its pros and cons, and why they matter from a web-scraping perspective.
Separate virtual environment
In short, it’s a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other in the same system thus preventing libraries or Python version conflicts.
If you didn’t work with a virtual environment before, have a look at the
dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get a little bit more familiar.
📌Note: this is not a strict requirement for this blog post.
Reduce the chance of being blocked
There’s a chance that a request might be blocked. Have a look
at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.
Make sure to pass User-Agent
, because Google might block your requests eventually and you'll receive a different HTML thus empty output.
User-Agent
identifies the browser, its version number, and its host operating system that represents a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not. And we're faking "real" user visit.
Using Google News Result API
The main difference between API and DIY solution is that it’s a quicker approach if you don’t want to create the parser from scratch, maintain it over time or figure out how to scale the number of requests without being blocked.
Basic Hello World example:
from serpapi import GoogleSearch
import json
params = {
"api_key": "...", # https://serpapi.com/manage-api-key
"engine": "google", # serpapi parsing engine
"q": "gta san andreas", # search query
"gl": "us", # country from where search comes from
"tbm": "nws" # news results
# other parameters such as language `hl` and number of news results `num`, etc.
}search = GoogleSearch(params) # where data extraction happens on the backend
results = search.get_dict() # JSON - > Python dictionaryfor result in results["news_results"]:
print(json.dumps(results, indent=2))
Outputs:
{
"position":1,
"link":"https://www.sportskeeda.com/gta/5-strange-gta-san-andreas-glitches",
"title":"5 strange GTA San Andreas glitches",
"source":"Sportskeeda",
"date":"9 hours ago",
"snippet": "GTA San Andreas has a wide assortment of interesting and strange glitches.",
"thumbnail":"https://serpapi.com/searches/60e71e1f8b7ed2dfbde7629b/images/1394ee64917c752bdbe711e1e56e90b20906b4761045c01a2cefb327f91d40bb.jpeg"
}
Google News Results API with Pagination
If there’s a need to extract all results from all pages, SerpApi has a great Python pagination()
a method that iterates over all pages under the hood and returns an iterator:
Source of the Article: https://serpapi.com/blog/