Learn how to do scrapping, Worried about how to get started?
Check out expert simplified blog and get this quick
Working with Requests and BeautifulSoup
Before starting any project or scripting any solution, we first start with creating a virtual environment.
Now, what is a Virtual Environment?
If you are good with a virtual environment and know what it is, and how to build and activate it, you can skip this and continue further with the blog.
The first step always includes making a virtual environment and working in an active environment. Thus, let’s start with making an environment and activating it. Open the terminal and make an environment named “scrap” and activate it.
Before moving ahead, ask a question to yourself:
What is your favorite food?
Let’s say, it’s a PIZZA. Yes, crunchy, delicious, cheesy, and scrumptious pizza.
Next question, can you cook pizza without knowing its recipe?
Or, can anyone cook it for you without knowing what PIZZA is, how it looks like, and how cheesy, crunchy, and delicious it tastes?
The answer is a BIG NO.!
Here, Pizza is our requests library. Before working on it, let us understand it.
Request, in English, it means asking for something to get it. Similarly, in python, a request library is used to request the content of a web page. Thus, the requests library performs the elementary task of fetching web page content. (Just like buying a PIZZA bread, veggies, cheese, and all your favorite topping for a PIZZA).
Every library in python needs to be installed before use. Let us install these libraries using pip.
To install the libraries, run the following commands in the terminal:
pip3 install requests
pip install beautifulsoup4
Below is the code which shows the working of the
requests library:
import requests
from bs4 import BeautifulSoup
res = requests.get(‘https://www.protonshub.com/')
print(res)
print(“ — — — — — — — — — “)
print(“The status code is “, res.status_code)
O/P:
Code Explanation:
Before using any library, we install and import it in our script. We achieved this through line 1. Secondly, requests.get() function is used to retrieve the status code. Printing “page” will give output like: <Response [200]> .To solely fetch the status code, we use .status_code (as in line 3)
Now, read the text written as the title of the page (The place where the mouse pointer is)
How to fetch the text written there?
This is when beautifulsoup comes into play. Request is used to get the source code of the web page, but the parsing of the page, i.e how to fetch data residing in a particular tag ? Beautifulsoup will help us in this.
Beautifulsoup library in python is used to extract text from markup languages like HTML, XML and others. Let’s try to fetch the title of the page.
You know where to find the tag our desired text resides in?
It is through inspecting. Inspecting an element gives us it’s tag. We know that title resides in <title> tag. Let’s head toward scraping it now.
import requests
from bs4 import BeautifulSoup
res = requests.get(‘https://www.protonshub.com/')
print(res)
print(“ — — — — — — — — — “)
print(“The status code is “, res.status_code)
print(“ — — — — — — — — — “)
soup_data = BeautifulSoup(res.text, ‘html.parser’)
print(soup_data.title)
But, ever though how to get the text which is on the web page?
Supposedly, we want to extract the highlighted text in the below image.
To attain this, we first inspect it. We find that it is in the h3 tag.
So, let’s parse the h3 tag and try to extract the text from it.
import requests
from bs4 import BeautifulSoup
res = requests.get(‘https://www.protonshub.com/')
# print(res)
# print(“ — — — — — — — — — “)
# print(“The status code is “, res.status_code)
# print(“ — — — — — — — — — “)
soup_data = BeautifulSoup(res.text, ‘html.parser’)
# print(soup_data.title)
h3_tags=soup_data.find_all(‘h3’)
print(h3_tags)
print(“\nh3_tag[1] : “ , h3_tags[1].text)
print(“\nh3_tag[1] text : “ , h3_tags[1].text)
Now, it’s over to you. As of now, we are done with the simplest scraping using request and a beautiful soup library.
How about the HAPPY CUSTOMERS of PROTONSHUB?
Do write a script that delivers the output:
Eberechi Asonye
Amine Belkhiria
(HINT: The solution lies around the smallest composite number)
Thanks for reading.
Happy scraping.!