Python Web Scraping

Python Web Scraping

What is web scraping? 

Let’s suppose you want to get some information from the internet and for an instance, you need a story written by Dale Carnegie! What will you do? You will go to the search explorer, find out the story and copy and paste the large information from Wikipedia or somewhere else to your document file. But this is only for a story and what if you want a huge amount of information from the internet? What will you do? Do you go to the website, and copy and paste the information the same way you did for that story? That would be a tedious task to do it manually. In these kinds of situations, copying the data from the internet and pasting it into your file will not work all the time. And that’s why you need to use Web Scraping to pull that information without doing anything manually. 

Web scraping is the pulling out of a huge amount of information from a number of websites. In Web Scraping, most of the data is unstructured in the format of HTML which can easily be converted into a structured form of data just like we do in Microsoft Excel Spreadsheets or a database. You can do Web Scraping by using various online services such as APIs, and software, or by creating your code in any programming language that will automate the process of pulling out the information. 

Web Scraping is very useful when you need to pull a huge amount of information from websites. To do web scraping, you need two parts of it such as the Crawler which is an AI algorithm that browses what you are looking for on the internet and finds a number of websites and crawls them, and the second part is the Scraper which is specific to extract the data from the website. Scrapers’ design may vary for different purposes based on the algorithm’s complexity. 

In this article, we are going to discuss how you can do Web Scraping with Python. We’ll see some scenarios where web scraping is used and is Python-compatible to do all types of web scraping and much more. 

Why is Web Scraping Used?

Web Scraping is very beneficial in getting a large amount of information from the internet and it reduces the difficulty of doing it manually. Large information from the internet can be pulled out very easily by the use of Web Scraping. 

You might be thinking why would anyone need to pull a large amount of information from the internet? What will be the use of it? To answer this, there are several applications of Web Scraping that will help you know why is it used. 

  1. Sentiment Analysis: Web scraping is very useful for analyzing the sentiments of people. For example, if an E-commerce website wants to understand the sentiments of its customers so that they can provide those services which they need. For these kinds of companies, sentiment analysis becomes the top priority. These companies use various platforms from where they collect the data such as the Social Media platforms Twitter and Facebook to get general information about any product that most consumers are interested in. This information helps them to know more closely what people desire and what can be a good product for consumers that will help them to move ahead of the market competition. 
  1. Email Marketing: Several companies also use Web Scraping to collect the email ids of people that can be used for the promotion of their products and services. Companies get these email IDs from various sites where the users registered in. And use the bulk email marketing strategies to expand their business and increase their reach directly to the customers. 
  2. Market Research: Web Scraping is also useful for doing market research. A large amount of scraped data can be used to analyze new trends in the market and find the best possibilities to head over from their competitors. This helps companies to understand where they should move forward to get an edge over their competitors in the market. 
  3. Price Comparison: A number of services use Web Scraping to get the information on the prices of the same products and this data can be used to compare the prices of different platforms for the same products. This price comparison is very useful to find out the platform which is offering the same product at a low price while the other company is selling it at a higher price.
  4. Job Listings: Job listing websites use Web Scraping to get the information about new job openings and requirements from several websites which they can use to list in a single place. This can be helpful for a person to find out the specific job they are looking for and the information gathered at the same place will help the people to get the information at the only place and they don’t need to search for the information from different platforms. 

This question often comes to mind pulling out the information from the internet and saving it in our files is legal or not? The best answer to this question is that some websites allow web scraping and some don’t. The crawler part of the Web Scraper automatically crawls the websites and automatically checks if a website has allowed web scraping or not. If you want to know that a website has allowed web scraping, then you can look for the “robot.txt” file of that website. Some websites’ robot.txt file is easily accessible but in some cases, they aren’t accessible. Generally, you enter the main domain address of the website and follow it by a slash (“/”) with “robots.txt” that will take you to this file and you can see if they allowed crawling by robots or not. 

Doing web scraping is not an illegal process from a technical point of view. But the practice of using the data that you scraped from the internet tells if it is legal or illegal. 

There are some strategies that you should follow while doing Web Scraping. These strategies are just like the rules of legal web scraping that include:

  • The requests that your crawler is trying to make with the websites to access the data should provide a gap of 12-15 seconds. 
  • You should avoid the use of any Web API for web scraping because that may lead you to some trouble as you don’t know about the configurations of the API and what if they use the data illegally on your behalf of web scraping. So it is advised to avoid the use of APIs for Web Scraping. 
  • The scraped data should not be reused on the internet or for commercial purposes. As the data belongs to the original owner only. So you should not use it for any commercial or marketing purpose. 
  • Whenever you are doing web scraping, you should go to the Terms & Conditions of the service first and read them carefully to understand if there’s something vulnerable or not. 
  • Suppose someone had put some restrictions on their data for web scraping, then you should ask them first before going further. 

Why is Python Good For Web Scraping?

There can be a number of reasons to choose Python for Web Scraping. Some of the common reasons are as follows:

  • Easy Syntax: The syntax of Python used for Web Scraping is very easy and can be understood easily. The syntax is just like a statement in English that helps you to understand it easily and properly. The indentation used in Python code also helps the user to differentiate parts of code to make it easy to understand. 
  • Large Community: Python is a popular programming language and one of the reasons for its popularity is that it has a large community. The benefit of having large community support is that if you’re stuck somewhere writing your code, then you don’t have to worry about it. The Python community is the most active one from where you can seek help. 
  • Easy to code: Python is easy to program and you don’t even have to use semi-colons (;) and curly braces (“}”) anywhere in the code which makes it more simple and easy. 
  • Dynamically typed: Python programming language is dynamically typed and you don’t have to define data types for variables before you use it. You can directly use the variable with any data type whenever required. This practice saves the time to code and also lessens the line of code you are writing. 
  • Small code can handle large tasks: Web Scraping is beneficial for saving time as it automates the manual task of pulling information. You might be thinking that the time you saved in web scraping is spent on the time writing the code. But that’s not true, you don’t have to write a huge code for web scraping. That’s another benefit of using Python for web scraping you don’t have to write large code for web scraping or large tasks. Because small code can also do large tasks very easily in Python. 
  • Huge collection of libraries and framework: Python has a large collection of libraries that can be used for several purposes in Web Scraping. Some of the popular libraries of Python include Matplotlib, NumPy, Pandas, etc. 

So, these were the reasons for choosing Python for Web Scraping. There can be more reasons too. But the common reasons are those we discussed above. In conclusion, we can say that Python is one of the most suitable languages for doing Web Scraping. 

How Do You Scrape Data From A Website?

In Web Scraping, the Crawler requests the URL that you mentioned in the Web Scraper. After that, the crawler gets a response from the URL that tells if you are allowed to read the HTML or XML pages, or not. Based on the response, the Web Scraper takes the next step. If you are allowed to read the HTML files, then it transfers the control to the Scraper that pulls the information from the URL and saves it in your files. And if you’re not allowed that it doesn’t take the further step of Scraping the data. 

The basic steps of Web Scraping with Python include:

  1. Go to the URL that you want to Scrape information from. 
  2. Inspect the pages of the URL and find relevant information. 
  3. Find the specific data that you requested to extract.
  4. Write the code for doing these tasks and run the code.
  5. The Scraper then extracts the data and stores it in the required format on your system.

Libraries used for Web Scraping

Here are some Python libraries that are used for Web Scraping and these include:

  • Requests Library: This library is very commonly used for web scraping. In general, we can say it is like an HTTP library that makes requests to the server for the data we need. This library makes several types of requests such as GET, POST, etc. 
  • Beautiful Soup: You may have heard about this name or the library. This is because it’s the most widely used Python library for Web Scraping. This library creates a parse tree that parses the HTML and XML pages and takes the data out from them. The reason behind its popularity and usage is its simplicity and ease to understand which helps beginners to understand it quickly. 
  • Selenium: Python’s Selenium library is mainly for automating the testing of web applications. However, Selenium is not specifically made for Web Scraping, but there is a number of benefits that can be obtained by using this library. This library makes it possible to run JavaScript code on web pages and enhances the power of web scraping. 
  • LXML Library: This is also a library of Python programming language. It parses the HTML and XML pages very fastly by combining the speed with the power of the element trees. This library works well for large datasets. 

Web Scraping Example

Let us understand the concept of Web Scraping with the help of an example:

To do web scraping, you need Python and the libraries that we discussed above to be installed on your system. And to scrape the data, you also need a web browser, that can be any like Chrome, Firefox, etc. 

Here are the steps of Web Scraping:

  1. Take the URL that you want to scrape: For instance, if you want to scrape data from a URL, you need to inspect the data source. In this step, you need to explore the URL that you want to scrape to understand its structure and the information. 

For an instance, if the URL you want to scrape is :https://www.mygreatlearning.com/academy/learn-for-free/courses/python-fundamentals-for-beginners

Now, you need to go to this URL and after that, you need to Right-click on the page and go to inspect option. 

After that, you will be headed towards a new tab of the browser where you will see the code. 

  1. Now, your task is to understand what information you want to extract from that URL. 
  2. After you decided on the information you want to extract, the actual coding starts in this step. 

To write the code, you need to create a virtual environment first where you need to install any of the libraries that we discussed in this tutorial. 

  1. Now, open the shell prompt, and write the following code:
From BeautifulSoup import BeautifulSoup
From selenium import webdriver
Import pandas as pndas

As we already discussed what is meant by requests, in the previous step we installed the requests library of the python.

setDriver = webdriver.Chrome(“/usr/lib/chromium-browser/chromedriver”)

In the above code, we configured the browser with our code. And now, we are going to declare some arrays that will take the data from the URL and store it in these arrays.

courses = []
prices = []
instructor = []
duration = []
ratings = []

Now we are going to use the libraries that we installed in our code to do web scraping:

Data = driver.page_source
Soup = BeautifulSoup(data)
For x in soup.findAll(‘a’ href = True, attributes = {‘class’ : ‘24Dsi7’):
nameOfCourse = a.find(‘div’, attributes = {‘class’: ‘_s234cJ’))
priceOfCourse = a.find(‘div’, attributes = {‘class’: ‘_sdgfh23’})
instructorName = a.find(‘div’, attributes = {‘class’ : ‘_79sgjss’})
durationOfCourse = a.find(’div’, attributes = {‘class’ : ‘_29fdks3’})
ratings = a.find(‘div’, attributes = {‘class’ : ‘_28fd7sdg’})
courses.append(nameOfCourse.text)
prices.append(priceOfCourse.text)
instructor.append(instructorName.text)
duration.append(durationOfCourse.text)
ratings.append(ratings.text)
  1. Now you need to run this code and extract the data. 
  2. After you extract the data, you need to save it in the required format.

To do this the following code can be applied:

file = pndas.DataFrame({‘Course Name’ : courses, ‘Prices’ : prices, ‘Instructor Name’ : instructor, ‘Duration of Course’ : duration, ‘Ratings of Course’ : raings})
file.to_csv(‘courses.csv’, index = False, encoding = utf-8)

Now after incorporating this code with the previous one, you need to run the whole code again. 

After running this code you will get the file named ‘courses.csv’ that has the data scraped from the URL you provided in this code. 

So, this was all about Web Scraping using Python. In the above example, we saw how we can do code for web scraping and get the data we need from URLs automatically. 

Recommend Courses



Source link

By GIL