Skip to content

Web Scraping with Python – in 5 minutes to a working Web Scraper!

There are millions and billions of websites on the Internet and a correspondingly huge amount of information. Web Scraping is the process of automated website reading with the goal to be able to gain a lot of information in a short amount of time. This article shows how to realize simple web scraping projects with Python.

Brief Introduction: Source Code of a Website

In order for us to understand and apply web scraping, we also need to look at the general structure and functioning of a website. In this article, we will try to cover only the most basic elements, without which this text would otherwise be incomprehensible. On w3schools there are also detailed tutorials on HTML, CSS, and JavaScript, where you can go further in-depth.

The basic structure of any website is implemented with the help of Hypertext Markup Language (HTML). This is used, for example, to define which text sections are headings, to insert images or to define different page sections. In addition, Cascading Style Sheets (CSS) are used to design the website, i.e. to define the font and font color or to specify the spacing between text elements.

With the help of HTML and CSS, a large number of web pages can already be recreated. However, many also use JavaScript, in addition, to breathe life into the content. Generally speaking, everything you see when you click the refresh button or when you come to a new page (with a different URL) is programmed with HTML and CSS. Everything that “pops up” afterward or is opened with a button without(!) loading a completely new page is programmed with JavaScript.

If you are interested in how specific websites are built, you can view the source code of the open page in most browsers with the key combination Ctrl & Shift & i (MacBook accordingly Cmd instead of Ctrl). Otherwise, you can also right-click and then click Examine.

Das Bild zeigt die Konsole in Google Chrome mit dem Quellcode der Seite www.databasecamp.de.
Screenshot of the Source Code of www.databasecamp.de

Python Libraries for Web Scraping

Python provides various libraries that can be used for web scraping. Basically, they differ in how “deeply” they can scrape information from the page. In web scraping, there are different levels of difficulty to get the desired data.

The easiest way to do this is to hide the information you want to grab in the code that gets executed when the page is initially loaded. The Python library Beautiful Soup is best suited for this, which we will also use in this example. Scrapy can be used for such applications too. It goes beyond Beautiful Soup in that it also helps with subsequent data processing and data storage.

The Selenium library is especially useful when interacting with the website must first take place in order to get the desired information at all. For example, if you have to log in first, you can open the website in Selenium and also interact with it via Python code. Beautiful Soup, for example, does not offer this feature, as it can only scrape the static elements of the website.

What do website operators think about scrapers?

For website operators, web scraping algorithms and other automated visits are comparatively easy to recognize via the web tracking tool. Simple web scrapers can be recognized, for example, by very short dwell times of less than one second or many pages accesses in a short time by one visitor. This suggests that it is not human access. The IP addresses of these visitors can then be blocked with just a few clicks. However, automated website visitors do not necessarily have to be a disadvantage for the company or website operator concerned.

The search engine Google, for example, scrapes relevant websites at regular intervals and searches them for so-called keywords and backlinks in order to update and improve the search sequence. Another example is price comparison sites, which regularly extract prices for relevant products with the help of web scrapers. An unknown online store for notebooks benefits from the wide reach of the comparison site and hopes to increase its sales as a result. These companies will gladly grant web scrapers access to their site. On the other hand, competitors can also use this method to query prices on a large scale, for example.

It’s hard to say whether web scraping on today’s web is just positive or just negative. In any case, it takes place more often than we think. In general, it is important for website operators that scrapers do not endanger the technically flawless functionality of the site and that they exclude automated visitors in their analysis so as not to draw false conclusions. Otherwise, it is decided on a case-by-case basis whether web scraping is rather harmful or useful for a site.

Example: Homepage Apple Inc.

For this example, we will scrape the Apple Inc. store to find out what product categories Apple currently offers. This simple example could be used to track categories over time and get an automated notification whenever Apple adds a new product category to its lineup. Admittedly, probably few people will really need such an evaluation, since you can find out after the Apple keynote on nearly any technology news portal, but it is an illustrative example.

The example shown was developed and tested in Python 3.7. First, we import two libraries, urllib we use for handling URLs and opening websites, and Beautifoul Soup for the actual web scraping. We store the complete code of the website in the variable “soup”, the library Beautiful Soup helps us with this.

import urllib.request
from bs4 import BeautifulSoup
url = "https://www.apple.com/store"
open_page = urllib.request.urlopen(url)
soup = BeautifulSoup(open_page, "html.parser")
print(soup)

An examination of the Apple Inc. site reveals that the names of the product categories are stored in sections with the class “rf-productnav-card-title”. These classes are assigned by the developers of the website and have to be searched for anew for each project. So before we can start building a web scraper, we need to identify the positions of the interesting information on the website. In our case, this is the described class.

Der Screenshot zeigt die Auflistung der Apple Produktkategorien und den dazugehörigen Quelltext, den wir fürs Web Scraping nutzen.
Apple’s Product Categories and its Source Code

In our source code of the website that we have stored in the variable “soup”, we now need to find all the elements of the class ” rf-productnav-card-title”. We do this with the command “findAll”. We can loop through these elements with a for loop and output only the text of the element in each case.

for category in soup.findAll(attrs={"class": "rf-productnav-card-title"}):
    category = category.text.strip()
    print(category)
# Output:
Mac
iPhone
iPad
Apple Watch
Apple TV
AirPods
HomePod Mini
iPod touch
Accessories
Apple Gift Card

So with these few lines of code, we managed to scrape information from the Apple website. Not all use cases will always be as quick to implement as this one. Of course, it can also happen that the information we are looking for is significantly more nested in the website. Furthermore, we also have to keep checking web scraping projects for functionality, as the scraped pages may change their structure, after which we have to rebuild our code. So there is no guarantee that our algorithm will still work in a few months, but must be checked and revised at regular intervals.

This is what you should take with you

  • Web scraping is the automated reading of web pages to extract the desired information.
  • Python offers various libraries for scraping, which are to be selected depending on the use case.
  • Some understanding of the structure of websites is required to be able to program a working scraper.

Other Articles on the Topic of Web Scraping

  • If you really want to know everything about web scraping with Python, I highly recommend the book “Web Scraping with Python” by O’Reilly.
Cookie Consent with Real Cookie Banner