A Comprehensive Guide to Web Scraping With Python

Umut Gokbayrak
Umut Gokbayrak 19 October 2020
Author:

Do you need to extract information from a web page?

Perhaps you want to download content from a directory. Or save images from a photo gallery. How can you do that without having to copy content manually, one piece at a time?

This article reveals how to extract webpage content programmatically by web scraping with Python.

Python is one of the most popular programming languages in the world. Python has become the lingo-franca for web scraping and data sciences; thanks to libraries such as BeautifulSoup4, pandas, numpy and requests.

Read on to discover how to web scrape using Python and keep it simple through an API.

What Is Web Scraping?

Web scraping is simple the process of extracting data from any web page programmatically.

Unlike a normal browser session, the user doesn't need to be present. Codes written in languages like Python or Java can even automate the process. You can parse or save the data, that's collected as required.

Web Scraping Use Cases

Use cases include:

  • Building a contacts list for marketing
  • Competitor analysis
  • Price comparison over time
  • Collecting data for machine learning
  • Web scraping isn't limited to text content. Binary data like image files and videos are also viable.
Scraping Code

Installing Dependencies

I assume you already got Python3 installed on your computer. If you need help installing Python 3, check out these tutorials for Linux, Windows, and Mac.

Create a new project directory and create a new virtual environment using the virtualenv command.


$ mkdir scraping-project
$ virtualenv -p python3 venv

once created we activate the virtual Python3 environment by running


$  source venv/bin/activate

That's it. Now we can install and remove dependencies as if scraping-project was the sole project that exists on the system. We start by installing the mighty "requests" and BeautifulSoup4 (bs4) packages as follows:


(venv) $ pip3 install requests bs4

Now we're ready to start scraping articles from Prompt API web site.

Code Examples

Task 1 - Scraping titles from Prompt API Blog

Fire up your favorite text editor (we prefer VS Code) and create a file called main.py. Type the following code to start your first scraping task.


from bs4 import BeautifulSoup
import requests

We start by importing two requirements (bs4 and requests). "Requests" is the library we use to fetch the remote content using HTTP and "BeautifulSoup4" is the library we use to parse the HTML content. Both are pretty popular among the Python community.

Next, we fetch the HTML content by:


page = requests.get('https://promptapi.com/blog')

Pretty simple, isn't it? Notice that we called the 'get' method of requests. As you may guess, in order to perform a HTTP POST request, you would call, requests.post() method.

With a single line of code, we've got the page object in memory. We can get the text content by page.text and start parsing it. Thanks to BeautifulSoup4, we don't need to deal with ugly HTML parsing burden, because that's already been done for you. BeautifulSoup4 is a perfect solution for that. Simply run the following line of code and parsing will be finished.


soup = BeautifulSoup(page.text, 'html.parser')

Like the DOM tree in memory, we can get all the articles running a find_all statement on soup object.


articles = soup.find_all('article')

So simple. Now we've got all the articles in hand and running in a loop we can print all the titles of the blog posts.


for article in articles:
  print(article.find('a').text) 

That's all. The whole code for the first scraping task is below.


from bs4 import BeautifulSoup
import requests

page = requests.get('https://promptapi.com/blog')
soup = BeautifulSoup(page.text, 'html.parser')
articles = soup.find_all('article')
for article in articles:
  print(article.find('a').text)

Task 2 - Fetching Meta Tags with CSS Selectors

BeautifulSoup4 is a very powerful library. One of the best things we love about it is ability to perform CSS Selectors on soup objects. Take the following example.


from bs4 import BeautifulSoup
import requests

page = requests.get('https://promptapi.com/blog')
soup = BeautifulSoup(page.text, 'html.parser')
metas = soup.select('meta')
print(metas)

The output will be like:

[<meta charset="utf-8"/>, <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>, <meta content="Highly curated API marketplace with a focus on reliability and scalability. Allows software developers building the next big thing much easier and faster." name="description"/>, <meta content="summary" name="twitter:card"/>, <meta content="@promptapi" name="twitter:site"/>, <meta content="@promptapi" name="twitter:creator"/>, <meta content="Prompt API | Hassle-free API marketplace" property="og:title"/>, <meta content="API marketplace and ready to run app backends for your mobile app and website." property="og:description"/>, <meta content="/assets/logo/square_large_bg.png" property="og:image"/>]

If we wish to select a single meta tag with name="viewport", it is also simple as a pie


from bs4 import BeautifulSoup
import requests

page = requests.get('https://promptapi.com/blog')
soup = BeautifulSoup(page.text, 'html.parser')
metas = soup.select('meta[name="viewport"]')
print(metas)

Task 3 - Setting custom HTTP headers

If you've followed the previous 2 examples, you've got the basic skills to scrape a web site using Python. But unfortunately, things are not that simple. Most of the times, the following site will cut off your IP address from accessing the site once they detect you're scraping their content.

In order to overcome this issue, you need to be acting like an actual browser. Start by setting the HTTP headers.

Most common HTTP headers are User-Agent, Accept-Language and Accept. Start by setting them and add more headers as needed.


from bs4 import BeautifulSoup
import requests

headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Language': 'en-US,en;q=0.9,tr;q=0.8'
}

page = requests.get('https://promptapi.com/blog', headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

Some web sites, use a method called fingerprinting. You may need to regenerate a new User-Agent string each time you make a request. In order to do that you will need a database of valid User-Agent strings. You can also check User Agent API for a comprehensive database of User-Agents and generate a new one each time you're making a new request.

Scraping AngularJS, React and Vue web sites

One of the biggest problems arise when scraping Javascript heavy sites, of which we call them SPAs (single page applications). These are built using technologies such as AngularJS, React and Vue. If you scrape the web site with the methods above, you'll get a bunch of Javascript code, which does not contain your desired value. These are the web pages, that are rendered on the client-side. You'll need to be armored with new technologies such as Selenium or Scrapy. We'll get to that topic in another article.

Easy Scraping with APIs

Unfortunately most of the times, your IP addresses will be exposed in just a few minutes and your connection will be cut off from accessing the remote site. Changing User-Agent string is not enough In order to overcome this situation, you'll need to change your IP address frequently as well as your browser headers. Thanks to Scraper API, this is done automatically. It lets you scrape virtually any website while bypassing all limitations. Features include:

  • Change the IP address to any country's IP range you desire
  • Customizing header data
  • Changing the user-agent automatically
  • Returning text and image data
  • Ability to fetch partial content via a CSS selector.

The Scraper API parses everything for you and returns the data you need. Easy!

But if you need to scrape Google for any reason, you'll need more advanced tools such as Google Search Results API. It will fetch search results, as well as adverts in JSON format. Check it for more details

Share:
 
Umut Gokbayrak
Written by

Umut Gokbayrak

I turn ideas into software products