ninja skills python

Scraping (web) – extracting data using python and bash [11]

Web Scraping

Today, I am going to show you the whole picture of how to do web scraping from websites. Before you begin, you need to assess the nature of the data.

Website: www.brainyquote.com

Nature: Quotes

Intellectual Property/Copyright : BrainyQuote is one of the biggest resources on quotes on the web. The quotes themselves are the intellectual property of their respective owners. However, I have added proper attributions to their website.

Week Ly (chrome/firefox/opera) extension: The problem with my extension is that, I wanted to display random quotes every week. The quotes were inserted in a .json file manually and the extension would load that json file.

To find and enter copy-paste quotes and authors took me a lot of time. The solution for me was to ease that whole process. In my mind, there had to be an easier way of extracting quotes. These were html text.

1. Identifying the interesting parts

First of all, we need to investigate the target site. Find the SoI (Stuff of Interest).

web scrapingEach quote on the website is surrounded by an image overlay.

web scrapingFrom the result of the inspection. We can see that the quotes appear in two places. If you look closely, the alt tag is the one that could simplify our task.

2. Making a plan

Now that we have identified the resources we want. We need a plan to extract the data.

  1. We know that the page itself is html.
  2. A regex that could identify the tags of interest.
  3. Extract the data inside the alt tag and save it in a plain .txt file
  4. Eliminate the noise that we might get. (There is always noise ! Always…)
  5. Treat the quotes and their authors as specific objects and separate them.
  6. Replace our existing quotes files with the new quotes.

3. BeautifulSoup

BeautifulSoup is a python library for pulling data out of html and xml files. Requests is a python library to make http(s) requests to website and reading data on the fly.

1.We know that the page itself is html. –>BeautifulSoup + Requests

Installing Python PIP

sudo apt-get install python-pip

Installing BeautifulSoup

sudo pip install bs4

Installing Requests

sudo pip install requests

4. Program base

#!/usr/bin/env python

#we start by creating a quoteextract.py file

#we import the libraries

from bs4 import BeautifulSoup as BS

import requests

# BS because we’re lazy

5. Using BeautifulSoup to extract quotes

# we define a variable called page which will contain our source page

page = requests.get(“https://www.brainyquote.com/topics/age”)

# from that source, we want only it’s content and we use the inbuilt html parser for getting data from the html source.

soup = BS(page.content, ‘html.parser’)

# We then find for each img tag, finds all the alt tags that contains a value regardless of the value

for img in soup.find_all(‘img’, alt=True):

print (img[‘alt’])

6. Checking the output on first try

web scraping pythonWe got pretty much all quotes. But you will notice at the end we got a Brainyquote line and a Please disable your ad blocker.

7. Removing the noise

You sed it !

sed ‘/BrainyQuote/d’ output.txt > output2.txt

sed ‘/Please/d’ output2.txt > output3.txt

Now our file is free from noise.

8. Analyzing our output

web scraping pythonWe got quotes, we got authors. Just need to have them dissociated and then re associated. 😀

Luckily, we have a standard separator between the quote and the author ( – ) a dash.

Before we continue, my original .json file has room only for 15 quotes. That one has more. Let’s eliminate the extra quotes.

We count the number of lines first:

cat output.txt | wc -l

We then take that value as max number of lines.

max= cat output.txt | wc -l

Once we have that value. We sed the other unwanted lines.

sed -e ’16,$d’ output.txt

As we have 15 quotes which matches our source. We can continue.

 

9. Reading in json input

jq is THE json parser on unix.

  • Let’s not waste time and install it straight away!

sudo apt-get install jq

  • Starting with reading one line of our weeklyquote.json just to give you an idea

cat weeklyquote.json | jq ‘.[] | .q[1]’

Output:

jq with python

  • Going deeper, we attempt to list the quote and it’s author as jq sees it.

web scraping python

  • We’ll use the same format for our quotes in ‘finaloutput.txt’

tr ‘-‘ ‘\n’

  • The translate command will replace all the dashes with a newline.

tr newline in linuxOur quotes are now on a newline. But the leading whitespace annoys me

  • We now remove the leading and ending white spaces with awk.

awk ‘{$1=$1}1’

remove leading and ending spaces awkwe just got rid of the spaces

  • Our final touch is to add quotes to authors and quotes

awk ‘{ print “\””$0″\””}

add quotes to beginning and end of lineBeautiful – we are done with our replacement file

10. Replacing with new quotes and authors

  • We start by extracting the old quotes from the json file

jq data extract from json

Our condition is less or equal to 14 as the first element of the array is 0.

extract data from json with jqOur oldquotes file is ready.

  • Now, we can proceed by matching and replacing the old quotes with the new quotes.

content replace with sed

This beauty will read from the old and new quotes file. It will then compare and replace the weeklyquote.json with new quotes and authors

 

  • We are done !!

Our new weeklyquote file can be found here.

 

Conclusion

We have successfully done some web scraping on a website. Read and updated content from a .json file using jq and some bash magic.

Week Ly is a chrome extension that shows you the current week. When you click on it, it will show you a random quote ! I’ll make sure to update it weekly now, no excuses 🙂

Week Ly extension chrome opera firefox current week numberDownload Week ly on Chrome ! Click on the icon

Also available on firefox and opera

Sources

Ninja featured image

Week Ly chrome extension

Beautifulsoup documentation

jq documentation

 

Facebook Comments

1 thought on - Scraping (web) – extracting data using python and bash [11]