Scraper API Review| How To Managing proxies, web scraping and Captcha in 2022

Disclosure: Our content is reader-supported. This means if you click on some of our links, then we may earn a commission. See how Dipto Designs is funded, May applicable for some post only and why it matters, and how you can support us.

We've covered web scraping and the techniques that can be used to accomplish this task in the first and second parts of this series. We used the BeautifulSoup and selenium python packages to accomplish this. If you haven't already, go check them out.

In this final installment of the web scraping series, we'll look at the scrapy library and the scraper API, as well as the reasons for employing these tools.

We'll scrape data from the moviedb website for this tutorial. This is simply an example; if you want any of their data, you may use their API. The source code can be found on Github.

What's the deal with scrapy?


Scrapy is a Python library that is open-source. Scrapy allows you to crawl multiple websites at the same time without worrying about threads, processes, synchronization, or anything else.

It is extremely quick and handles your requests asynchronously. If you wanted something similar in your own crawler, you'd have to code it from scratch or utilize an async library.

What is the purpose of Scraper API?


Please note that some of the links below are affiliate links that will not cost you anything extra. Know that I only recommend items, tools, and learning services that I've used and found to be truly beneficial. Most importantly, I would never recommend purchasing something you can't afford or aren't prepared to adopt.

Scraper API is a startup that specializes in tactics that will help you scrape the web without worrying about your IP address being blacklisted. They use IP rotation to keep you from being detected. With over 20 million IP addresses and unlimited bandwidth, this service is unrivaled.

Additionally, they handle CAPTCHAs for you and enable a headless browser so that you look to be a real user and are not identified as a web scraper. Scrapy isn't the only tool in the python ecosystem; it also works with requests, BeautifulSoup, and selenium.

Other popular platforms such as node.js, bash, PHP, and ruby can also be integrated. On the HTTP get request, simply concatenate your target URL with their API endpoint, then proceed as you would with any site scraper. In this walkthrough, I'll show you exactly how to do it.

You'll earn a 10% discount on your first purchase if you use this scraper api link and the coupon code lewis10!! You can always begin with their generous free plan and upgrade as needed.

Getting everything ready.


To begin, we must first install the scrapy library. Install scrapy with pip.

After that, go to the scraper API page to obtain an API key. We'll need this to quickly and easily access their services. So keep your grip tight.

launching the project
We should be ready to construct the web crawler after completing these two procedures. Scrapy startproject projectName is a command to run.

This will start our project, which will have the following structure.

Folder Structure

Now comes the fun part: we'll make a file called movies.py in the spider's folder. The majority of the code required to run our web crawler will be stored here.

This is how our full code will appear.

import from scrappy import spider from..items Import API KEY from.config to getmoviesItem

movies for a class

name=”movies” page number=15 Crawl(Spider):

url link=”https://www.themoviedb.org/movie?page=1″

start urls=

[‘http://api.scraperapi.com/?api key='+ API KEY + ‘&url=' + url link + ‘&render=true'] [‘http://api.scraperapi.com/?api key='+ API KEY + ‘&url=' + url link + ‘&render=true']

movies=response.css def parse(self,response): (“div.item.poster.card”)
for movie in movies: items=GetmoviesItem()
items[“title”]=movie.css(‘.title.result::text').
items[“rating”]=movie.css(“.user score chart::attr(data-percent)”).extract() items[“description”]=movie.css(“.overview::text”).extract() items[“poster link”]=movie.css(‘.poster.lazyload.fade::attr(data-src)') ()

deliverables

str(self.page number) next page url = “https://www.themoviedb.org/movie?page=”+ str(self.page number) next page url = “https://www.themoviedb.org/movie?page=”+ str(self.page num
next page='http://api.scraperapi.com/?api key='+ API KEY + ‘&url='+ next page url + ‘&render=true' next page='http://api.scraperapi.com/?api key='+ API KEY + ‘&url='+ next page url + ‘&render=true'

If self.page number=15, use self.page number+=1 to get a response.

follow(next page,callback=self.parse)
It may appear terrifying at first, but we'll go over it line by line.

The first three lines are library imports and elements that we'll need to make a working web crawler.

import from scrappy import spider from..items Import API KEY from.config to getmoviesItem
Don't worry about the GetmoviesItem import for now; we'll get to it later. I made a second file to keep any necessary configurations. It was the API key we obtained from the scraper API in this situation.

movies for a class

name=”movies” Crawl(Spider):

url link=”https://www.themoviedb.org/movie?page=1″
page number=15

start urls=

[‘http://api.scraperapi.com/?api key='+ API KEY + ‘&url=' + url link + ‘&render=true'] [‘http://api.scraperapi.com/?api key='+ API KEY + ‘&url=' + url link + ‘&render=true']
Things start to become intriguing at this point. The moviesCrawl class is created first, and it inherits from the spider class that was imported at the top of the project. This class will serve as the foundation for our web scraper, and it will be from here that we will configure the web crawler's behavior.

First, we must assign it a name, which we do in the names variable. When we are finished generating the scraper, we will use this name to run it.

The url link variable simply points to the URL that we wish to scrape. You'll see that it's a paginated webpage that requires you to fill out a form.

https://www.themoviedb.org/movie?page={{page number}}

The page number variable will use this pattern to move the scraper through several pages on the target site automatically.

Finally, in scrapy, the start urls variable is a keyword. When no specific URLs are supplied, this is a list of URLs from which the spider will start crawling. As a result, the pages specified here will be the first to be downloaded.

All we have to do now is concatenate our url link with the scraper API endpoint to enable us to use the scraper API to its full potential.

http://api.scraperapi.com/?api key='+ API KEY + ‘&url=' + url link + ‘&render=true&render=true&render=true&render=true&render=true&render=true&render=true&render=true&render=true&r

The render=true option merely instructs the scraper API to enable javascript rendering, allowing for the use of a headless browser. This is a reduced version of what we covered using selenium.

movies=response.css def parse(self,response): (“div.item.poster.card”)
for movie in movies: items=GetmoviesItem()
items[“title”]=movie.css(‘.title.result::text').
items[“rating”]=movie.css(“.user score chart::attr(data-percent)”).extract() items[“description”]=movie.css(“.overview::text”).extract() items[“poster link”]=movie.css(‘.poster.lazyload.fade::attr(data-src)') ()

scrapy's documentation yields stuff

The parse method is responsible for parsing the response and returning scraped data and/or further URLs to investigate.

In basic terms, this means that we can change the data acquired from the target web site we want to scrape using this way. We defined web scraping in our previous two walkthroughs.

The process of collecting information from a webpage by exploiting patterns in the underlying code of the page. We can gather unstructured data from the internet via web scraping, process it, and save it in a structured fashion.

We can automate data extraction once we've recognized the patterns in the web page's code. So, let's have a look at those DOM elements.

Each movie item is enclosed in a div with the class item, poster, and card, as seen in the image above. With this knowledge, we'll tell the crawler to collect all CSS elements with those characteristics.

Let's have a look at the GetmoviesItem class we imported at the start of the script before moving on.

import from scrappy GetmoviesItem(scrapy.Item): Item class GetmoviesItem(scrapy.Item):

Here you can specify the fields for your item, such as title=scrapy. Field() has a grade of scrappy. description=scrapy in Field(). poster link=scrapy is a field(). Field()

We need to store the data in an organized fashion once we've crawled the site data. The scraped data is collected in these items objects, which are simply containers. They offer a dictionary-like API with a simple syntax for declaring the fields that are available. Check out this page for further information.

What we've defined in the code above will operate as dictionary keys, storing the information we've collected.

Are we still moving forward? Great. We'll keep moving forward.

The items variable will be a GetmoviesItem object. We can extract individual properties from each movie from here, using the same field names defined as our dictionary keys. Take, for example, the rating information. This attribute is saved in a user score chart element with the type user score chart. The attribute “data-percent” exists within this HTML element, which is why we used the attr function to access the data stored there. We can now get all of the data we require by utilizing the yield function.

The final section of the code is as follows:

str(self.page number) next page url = “https://www.themoviedb.org/movie?page=”+ str(self.page number) next page url = “https://www.themoviedb.org/movie?page=”+ str(self.page num
next page='http://api.scraperapi.com/?api key='+ API KEY + ‘&url='+ next page url + ‘&render=true' next page='http://api.scraperapi.com/?api key='+ API KEY + ‘&url='+ next page url + ‘&render=true'

self.page number+=1 if self.page number=15:
We use the pagination URL to loop through as many pages as we like with yield response.follow(next page,callback=self.parse). Because we'll be connecting to the scraper API's endpoint, we won't have to worry about our IP address being blacklisted because they've set up proxies for us. However, I would advise avoiding making too many requests to a target site during web scraping, as this can degrade the platform's overall user experience.

Finally, depending on the file type you select, storing the data is as simple as running any of these commands.

-o filename.csv scrapy crawl movies
-o filename.json scrapy crawl movies
-o filename.xml scrapy crawl movies


Proxy for feature evaluation


I tested this feature using httpbin, and the IP rotations worked flawlessly on many queries. proxies It's vital to keep in mind that during IP rotation, request times will slow down, making your web scraper perform slower than usual.

Captcha


To test this feature, go to a website that uses captcha and run the script. Truepeoplesearch is a good place to start because it displays a captcha form right away. You'll discover that the scraper API can easily take care of this for you, allowing you to scrape as usual.

A browser without ahead


Run the script on a javascript-heavy site without the render=true parameter and observe the variations. The quotes to scrape js powered site is a fantastic place to start.

Conclusion


Hopefully, you can use scrapy to create a basic web crawler and use the scraper API as well.

For more information, see their documentation page to discover the wonderful capabilities they offer to help you avoid some of the difficulties that come with web scraping.

Sharing Is Caring:
Avatar photo

We share information related to Games, Design, Marketing, technology and things related to the Internet. Follow us on Facebook, Twitter, Instagram, and Telegram get latest updates on trending topics.

Leave a Comment

Prepare Yourself for a Major Google Algorithm Update!

X