4.66 out of 5
4.66
155 reviews on Udemy

The Ultimate Web Scraping With Python Bootcamp 2023

Learn to extract data from the web with python with just one course, covering selectolax, playwright, scrapy and more
Instructor:
Andy Bek
1,688 students enrolled
English [Auto]
Understand the fundamentals of web scraping in python from absolute scratch
Scrape information from static and dynamic websites and extract it to a variety of formats
Intercept and emulate hidden APIs to identify highly productive alternatives to getting your data
Master the requests library for working with HTTP
Parse and extract content from HTML using beautifulsoup, selectolax, and Microsoft Playwright
Master complex CSS selectors including descendant, child, sibling combinators
Understand how the web works, including HTTP, HTML, CSS, and JavaScript
Create scrapy crawlers and practice items, itemloaders and custom pipelines
Integrate scrapy with playwright for highly performant, fine-tuned dynamic website crawling
Practice processing and extracting data to a variety of formats including csv, json, xml, and SQL

Welcome to the Ultimate Web Scraping With Python Bootcamp, the only course you need to go from a complete beginner in python to a very competent web scraper.

Web scraping is the process of programmatically extracting data from the web. Scraping agents visit a web resource, extract content from it, and then process the resulting data in order to parse some specific information of interest.

Scraping is the kind of programming skill that offers immediate feedback, and can be used to automate a wide variety of data collection and processing tasks.

Over the next 17+ hours, we will methodically cover everything you need to know to write web scraping agents in python.

This bootcamp is organized in three parts of increasing difficulty designed to help you progressively build your skill.

Part I – Begin

We’ll start by understanding how the web works by taking a closer look at HTTP, the key application layer communication protocol of the modern web. Next, we’ll explore HTML, CSS, and JavaScript from first principles to get a deeper understanding of how website are built. Finally, we’ll learn how to use python to send HTTP requests and parse the resulting HTML, CSS, and JavaScript to extract the data we need. Our goal in the first part of the course is to build a solid foundation in both web scraping and python, and put those skills to practice by building functional web scrapers from scratch. Selected topics include:

  • a detailed overview the request-response cycle

  • understanding user-agents, HTTP verbs, headers and statuses

  • understanding why custom headers can often be used to bypass paywalls

  • mastering the requests library to work with HTTP in python

  • what stateless means and how cookies work

  • exploring the role of proxies in modern web architectures

  • mastering beautifulsoup for parsing and data extraction

Part II – Refine

In the second part of the course, we’ll build on the foundation we’ve already laid to explore more advanced topics in web scraping. We’ll learn how to scrape dynamic websites that use JavaScript to render their content, by setting up Microsoft Playwright as a headless browser to automate this process. We’ll also learn how to identify and emulate API calls to scrape data from websites that don’t have formally public APIs. Our projects in this section will include an image scraper that can download a set number of high-resolution images given some keyword, as well as another scraping agent that extracts price and content of discounted video games from a dynamically rendered website. Topics include:

  • identifying and using hidden APIs and understanding the benefits they offer

  • emulating headers, cookies, and body content with ease

  • automatically generating python code from intercepted API requests using postman and httpie

  • working with the highly performant selectolax parsing library

  • mastering CSS selectors

  • introducing Microsoft Playwright for headless browsing and dynamic rendering

Part III – Master

In the final part of the course, we’ll introduce scrapy. This will give us an excellent, time-tested framework for building more complex and robust web scrapers. We’ll learn how to set up scrapy within a virtual environment and how to create spiders and pipelines to extract data from websites in a variety of formats. Having learned how to use scrapy, we’ll then explore how to integrate it with Playwright so that we tackle the challenge of scraping dynamic websites from right within scrapy. We’ll conclude this section by building a scraping agent that executes custom JavaScript code before returning the resulting HTML to scrapy. Some topics from this section:

  • learning how to set up scrapy and explore its command line interface (“the scrapy tool“)

  • dynamically explore response objects using scrapy shell

  • understand and define item schemas and load data using itemloaders and input/output processors

  • integrate Playwright into scrapy to tackle dynamically rendered JavaScript sites

  • write PageMethods to specify highly specific instructions to the headless browser from right within scrapy

  • define custom pipelines for saving into SQL databases and highly customized output formats

In this bootcamp, I will take you step-by-step through engaging video lectures and teach you everything you need to know to get started with web scraping in python.

By the end of this course, you will have a complete toolset to conceptualize and implement scraping agents for any website you can imagine.

See you inside!

Introduction

1
Prerequisites
2
A Useful Mental Model
3
All Code Resources

The HTTP Protocol

1
What Is HTTP?
2
The Request-Response Cycle
3
Extra: But, This Website Remembers Me
4
User-Agents
5
HTTP Verbs
6
Status Codes
7
Headers
8
Extra: Headers Do Lie
9
Proxies

HTML, CSS, And JavaScript

1
The Ingredients
2
Markup
3
Attributes
4
Presentation
5
Some More Rules
6
Behaviour
7
More JavaScript
8
JavaScript In Web Scraping
9
Comments
10
Embedded

Web Requests In Python

1
Urllib
2
Requests
3
Setting Headers
4
Query Parameters
5
Authentication And Authorization
6
Aside From GET
7
POSTing Data

Parsing And Extraction

1
BeautifulSoup
2
Tags
3
Parents, Children, And Descendants
4
Siblings
5
Extracting Text
6
All Strings
7
Search
8
Challenge
9
Solution
10
Solution Refinement
11
An Extra: pandas
12
Functional Search Patterns
13
Text Search
14
Searching By CSS
15
Just One Tag

Project 1 - Portfolio Valuation With Google Finance

1
Scope Statement
2
An Extra: Some Finance Concepts
3
Parsing Price
4
Non-USD Prices
5
Adding Structure With Dataclasses
6
Position And Portfolio
7
Tabular Display

APIs: The Hidden Gems

1
Befriend The Network Tab
2
Case Study: Coffee Shop Locations
3
The Advantages Of APIs
4
Full Header Emulation
5
An Extra: Postman
6
Code Generation
7
Challenge
8
Solution: Interacting With The API
9
Solution: Processing The Data
10
Solution: Adding Geocode

Selectolax And Advanced CSS Selectors

1
Introduction
2
What Is selectolax?
3
CSS Combinators
4
Sibling Combinators
5
Selector Types

Project 2 - Image Scraper

1
Scope Statement
2
Prospecting
3
NOTE: Quick Correction To CSS Selector
4
Scraping HTML
5
Filtering Relevant URLs
6
Extracting High-Res Image URLs
7
Saving The Images
8
Stepping It Up With Logging
9
Back To The API
10
Filtered Canonical URLs
11
Pagination Prospecting
12
Wrapping Up

Tackling JavaScript With Microsoft PlayWright

1
What You See vs. What You Get
2
Rendering JavaScript
3
PlayWright Over Selenium
4
Case Study: Show Me The Money

Project 3 - Building A Configurable Scraping Pipeline

1
Scope Statement
2
Initial Setup
3
Fully Loaded Site
4
Selecting Game Containers
5
More Robust Render Thresholds
6
Extracting Title And Thumbnail
7
Game Category Tags
You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4.7
4.7 out of 5
155 Ratings

Detailed Rating

Stars 5
110
Stars 4
32
Stars 3
10
Stars 2
2
Stars 1
2