Photo by Nicolas Picard on Unsplash

As a data scientist or data enthusiast, one is always hungry for lots and lots of DATA. I can imagine the heart-eyes when you see lots of data in a website and your desire to grab all the data, perform all sorts of techniques you have learnt, apply statistics, machine learning ; sometimes it might be for fun, for learning or for some business purpose, but you know that grabbing a large amount of data is the most time-consuming part in the life of a data scientist.

There comes web scraping to the rescue. Web scraping is simply performing the request to the website which you want and just scrape the HTML tags you have, find the relevant tag and boom... you have an enormous amount of data. Web scraping is not difficult, but how you select which data to select for your analysis is the work of art.

I am going to show you a very simple way to scrape a website and collect the data you want very fast using an example of scraping from amazon.com. This can be used for any website for any kind of data. We will not be using ASIN which Amazon uses to provide ID to their catalogue, instead, we will go through a general approach which can be used for any website.

We will create a simple python script and some magic touch of multithreading/multiprocessing to make it much faster. (yes, multithreading works in this case since we are dealing with mostly I/O calls and hence no GIL issues). We will use a queue to have all the responses from different processes/threads which we will use to save data in a dataframe and then save it in a CSV file.

Consider you have to scrape data from amazon.com and you need to save the products’ names, prices and ratings and I am considering you might want to perform some data analysis on these products; hence save it in a pandas dataframe and CSV for further analysis

How to observe the useful data in HTML?

  1. Open the page from where you want the data. Open by clicking the 2nd or 3rd or any of the pages for your search. You will see the URL is something like https://www.amazon.com/s?k=laptops&page=2&qid=1567174464&ref=sr_pg_2. You can remove the &qid=1567174464&ref=sr_pg_2 and click enter, still it works. So this becomes your effective URL: https://www.amazon.com/s?k=laptops&page=2

2. Right-click on the data which you want to see the tag of (eg. product name)and then click Inspect. This will open the HTML for that page. (Observe in above gif)

3. Observe the tag where your element is present. In this case, its the name of the laptop. When clicked, we get the span element which contains this text :

< span class= “a-size-medium a-color-base a-text-normal”>

4. Now scroll up to see where the main tag which contains all the required data is :

<div class= “sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28”>

Observe that the parent tag of all the required elements (since when we hover over this tag, all the required elements are highlighted as shown in gif)

Now, we need to just find the <span> elements of price and ratings, which are present in the same div tag. So, all the required tags are:

parent tag: <div class= “sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28”>
product name: < span class= “a-size-medium a-color-base a-text-normal”>
price: <span class= “class= a-offscreen’”>
rating: <span class= “class= a-icon-alt’”>

Now we have all the data we need to scrape, so without further ado, let's jump to the coding :

Out of libraries which we are going to use, requests and beautifulSoup are the most important libraries for scraping the web. Requests is used to perform HTTP requests over the internet. BeautifulSoup is the library that deals with the messiest part: the HTML and XML. Here, we will be utilizing its HTML capabilities as it provides a structure to the HTML objects, which can then be easily accessed.

import requests # required for HTTP requests: pip install requests
from bs4 import BeautifulSoup # required for HTML and XML parsing                                                              # required for HTML and XML parsing: pip install beautifulsoup4
import pandas as pd # required for getting the data in dataframes : pip install pandas
import time # to time the requests
from multiprocessing import Process, Queue, Pool
import threading
import sys

Rest all imports are self-explanatory. Now we need to define some variables. Let's see them one by one:

proxies = { # define the proxies which you want to use
'http': 'http://195.22.121.13:443',
'https': '
http://195.22.121.13:443',
}

startTime = time.time()
qcount = 0 # the count in queue used to track the elements in queue
products=[] # List to store name of the product
prices=[] # List to store price of the product
ratings=[] # List to store ratings of the product
no_pages = 9 # no of pages to scrape in the website (provide it via arguments)

The products, prices, ratings lists will be used to separately store each of them to be used to put in the dataframe. qcount is used to track the queue which we will be using across different processes/threads so that each of the process/thread can update the result and ends.

Now the action begins:

We will go through the code in chunks. Observe the bold part of code:

def get_data(pageNo,q):
 headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
 r = requests.get("https://www.amazon.com/s?k=laptops&page="+str(pageNo), headers=headers)#,proxies=proxies)
content = r.content
soup = BeautifulSoup(content)
for d in soup.findAll('div', attrs={'class':'sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28'}):
name = d.find('span', attrs={'class':'a-size-medium a-color-base a-text-normal'})
price = d.find('span', attrs={'class':'a-offscreen'})
rating = d.find('span', attrs={'class':'a-icon-alt'})
all=[]
if name is not None:
all.append(name.text)
else:
all.append("unknown-product")
if price is not None:
all.append(price.text)
else:
all.append('$0')
if rating is not None:
#print(rating.text)
all.append(rating.text)
else:
all.append('-1')
q.put(all)
print("---------------------------------------------------------------")
results = []

The webpage which we are trying to access is: https://www.amazon.com/s?k=laptops&page=1

Let’s try to understand this method def get_data(pageNo,q). Here, the method takes pageNo and queue as the input. This queue is the common queue which is used between processes/threads to update after they get the result.

Here, requests library is used for HTTP calls to the Amazon webpage. Important to note is that Header is necessary as it will make the Amazon servers believe its a genuine request and not from bots (watch it Amazon, we got you :P).

def get_data(pageNo,q):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64;     x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate",     "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
r = requests.get("https://www.amazon.com/s?k=laptops&page="+str(pageNo), headers=headers)#,proxies=proxies)
content = r.content
soup = BeautifulSoup(content)
for d in soup.findAll('div', attrs={'class':'sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28'}):
name = d.find('span', attrs={'class':'a-size-medium a-color-base a-text-normal'})
price = d.find('span', attrs={'class':'a-offscreen'})
rating = d.find('span', attrs={'class':'a-icon-alt'})
all=[]
if name is not None:
all.append(name.text)
else:
all.append("unknown-product")
if price is not None:
all.append(price.text)
else:
all.append('$0')
if rating is not None:
#print(rating.text)
all.append(rating.text)
else:
all.append('-1')
q.put(all)
results = []

Now comes the beautiful part of the code, using BeautifulSoup.

There are just 4 steps in this process:

content = r.content
  • Create a content object using request.content (you could use request.text, content allows you to get binary data as well like images or pdfs)
soup = BeautifulSoup(content)
  • Create a Beautifulsoup object out of this content, which will make it much easier to parse through this content object
for d in soup.findAll('div', attrs={'class':'sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28'})
  • Using for loop, iterate through soup object to find all the divs where class is “sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28".
  • Extract the name, product, rating etc. ie. the object of your interest using the span tags and their relative class names. append it to the all object, which will then be inserted into the queue.

This creates the method for single thread/process which will carry out the request, soupify and extract the objects and then insert in the queue (shared by all processes/threads using Manager).

You can use Manager.list/dict as well instead of Queue in this case

Main Method (starting threads/processes):

if __name__ == "__main__":
m = Manager()
q = m.Queue()
p = {}
if sys.argv[1] in ['t', 'p']:
for i in range(1,no_pages):
if sys.argv[1] in ['t']:
print("starting thread: ",i)
p[i] = threading.Thread(target=get_data, args=(i,q))
p[i].start()
elif sys.argv[1] in ['p']:
print("starting process: ",i)
p[i] = Process(target=get_data, args=(i,q))
p[i].start()
for i in range(1,no_pages): # join all the threads/processes
p[i].join()
else:
pool_tuple = [(x,q) for x in range(1,no_pages)]
with Pool(processes=8) as pool:
print("in pool")
results = pool.starmap(get_data, pool_tuple)
print(results)
while q.empty() is not True:
qcount = qcount+1
queue_top = q.get()
products.append(queue_top[0])
prices.append(queue_top[1])
ratings.append(queue_top[2])
print("total time taken: ", str(time.time()-startTime), " qcount: ", qcount)
df = pd.DataFrame({'Product Name':products, 'Price':prices, 'Ratings':ratings})
df.to_csv('products.csv', index=False, encoding='utf-8')

In the main method, I am presenting three ways in which you can call the get_data method using parallelism and concurrency:

  • using threads
  • using processes
  • using pool.
if sys.argv[1] in ['t', 'p']:

Above conditional statement allows us to run either thread or process in case the user inputs the argument as t/p.

for i in range(1,no_pages):

This for loop runs for the no_pages specified. I have hardcoded it. You can pass it as an argument from the user. This will get appended to the URL and each URL is added to a new thread/process:

eg. https://www.amazon.com/s?k=laptops&page=1 runs in one thread, then https://www.amazon.com/s?k=laptops&page=2 in another and so on...

p[i] = threading.Thread(target=get_data, args=(i,q))
p[i].start() or
p[i] = Process(target=get_data, args=(i,q))
p[i].start()

Manager.Queue object is passed to the thread/process as it will help in having a common queue for all the processes/thread. The above part will create thread/process depending on the user input and then start the thread/process.

for i in range(1,no_pages): 
p[i].join()

Now, once the threads/processes are started, we need to join them in a separate loop so that the rest of the program waits for these threads to be executed, else the rest of the program will continue and threads will keep running in the background.

pool_tuple = [(x,q) for x in range(1,no_pages)]
with Pool(processes=8) as pool:
print("in pool")
results = pool.starmap(get_data, pool_tuple)

If the user doesn’t give any of the inputs as t or p, then it means it will by default run the pool method for parallelization. We generate a list of pool tuple, which is of the format: (page_no, queue) : [(1,q),(2,q)…] etc.

Q: why use starmap method here?
A: Since we are passing more than one argument here, we need to use starmap method.

Now get fruits of your labour:

Processing is completed, all the data is in the queue from all the processes. Now we need to get the data from the queue, which is very simple:

while q.empty() is not True:
qcount = qcount+1
queue_top = q.get()
products.append(queue_top[0])
prices.append(queue_top[1])
ratings.append(queue_top[2])

We iterate through the queue until its empty using queue.get, where each element received is in the form of a list which can then be extracted and appended to the products, prices and ratings lists.

df = pd.DataFrame({'Product Name':products, 'Price':prices, 'Ratings':ratings})
df.to_csv('products.csv', index=False, encoding='utf-8')

In this last part of the code, we create the dataframe out of the lists and save it as CSV using pandas.

And the result is …

Here you have is a beautifully created products.csv with Product Name, Price and Ratings. With this, you have successfully scraped the amazon.com.

Important points to remember:

Don’t use multiprocessing queue object as that will throw you an error in case of pool objects Always use Manager. Queue object as used above. Frankly, everywhere I have researched, I have got this as the only solution but why multiprocessing. Queue doesn’t work, I still haven’t been able to find an answer. I have raised the question on StackOverflow but in vain.

Enhancements to be considered:

  1. Instead of hardcoding “laptops” as labels, and the number of pages, you can pass it as a list input from the user and then iterate through all the labels in case you want data for multiple products.
  2. Try for various other tags on the website and scrape various other product information.
  3. In case of different labels like laptops, Televisions, mobile phones, since we are using multithreading, the data will be random while taking out from queue because any thread/process might have inserted to the queue at any time. Here comes the dataframe to the rescue; as it's very fast and easy to convert this dataframe to separate dataframes (and hence CSVs) based on label name.
Important Note: This web scraper is just for educational purposes and should be used with caution as many websites block the IP in case they discover the bots from any IP. You can use some of precautions in case the data you need is very huge and you need to do it very fast:
  • Use different proxies and proxy rotation from a proxy server in the requests library. For proxy rotation, many free proxies can be found using “freeproxy.com”, “hidemyass.com” and there are thousands of these websites. Some proxy websites provide authentic proxies if you don’t want to search for too long for proxies at a very nominal cost.
  • This is a very nice article which I found very useful which provides methods to save your a** while scraping (pun intended :D).

The GitHub link to code is here:

tseth92/web-scraper

Please comment for any doubts/discussions. I will be back with some more articles related to data science, python and pandas. For further discussions, connect with me on LinkedIn Until then, Happy Scraping .


Scraping the Web: A fast and simple way to scrape Amazon was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.