python web scraper source code

Scraper (Web) explanation + full source code [12]

Follow-up

This is a follow-up article from yesterday. On Sunday, I showed how to scrape data from websites using python and bash. Today, we are going to merge the parts to form a whole.

If you haven’t read my previous article, you can do so by clicking here.

Purpose

Week Ly is a browser extension available on chrome, firefox and opera. It shows the current week we are in. It also displays a random quote when clicked on. Up till yesterday, I was adding the quotes manually in a .json file.

Scraping quotes

The idea is to scrape quotes from the website: Brainyquote.

We were able yesterday to do it with a series of bash and python scripts. Today, we are making it a full application.

/!\ Website scraping should be fair, you should put proper attribution to the website in question /!\

Essential parts for our quote scraper

  1. The source website – brainyquote
  2. Quote extractor gets new quotes from source – quoteextract.py
  3. The data -quotes, authors, noise is stripped from excessive quotes and noise- excessremoval.sh
  4. The clean data with new quotes is formatted by adding double quotes – formatter.sh
  5. We take the old quotes from the .json file and extract it in a .txt file – dataextract.sh
  6. We use the new quotes and old quotes file, to supplement our quotes .json file – contentreplace.sh

1. Quote extractor

Original:

website scraper in python bash

New reusable module:

python web scraping

Explanation:

  1. Parse a topic as an argument.
  2. Use the main url of brainyquote.
  3. Concatenate the topic to the url.
  4. Check if the topic/url exists. If no, we exit.
  5. Topic exists. Create a new file brainy_quotes.txt or truncate if exists.
  6. Scrape the quotes to our file: brainy_quotes.txt

2. Excess/Noise Remover

Original:

sed delete lines remove pattern

New Cleaned module:

sed delete lines remove pattern

Explanation:

  1. Since we already know the name of the output file we are getting.
  2. This module stays pretty much the same.
  3. We remove the comments.
  4.  And delete the temporary files.

3. Formatter

Original:

awk add quotes to file remove whitespaces

New Script:

awk remove blanks add quotes

Explanation:

  1. This script changed significantly.
  2. Since it was printing text on the screen and not saving anywhere.
  3. Forget the 2nd line for a minute.
  4. We start by reading a line.
  5. Replace the dashes by newline.
  6. Remove whitespaces
  7. Add quotes using awk
  8. Redirecting the input to formatted.txt
  9. Now we go back to line 2. We removed formatted.txt if it exists. Why? Simply because we are appending to a file and if it exists, more garbage would be appended. We start with a clean formatted.txt
  10. Update: Due to issues with tr supporting only single character, I had to replace tr with sed. In cases where there are multiple dashes, we want to consider only the last one !

4. JSON data extractor

Original:

extract data json jq

New Script:

json data extract jq

Explanation:

  1.  Just removed sleep 1 comment
  2. sleep 1 was useful during debugging.
  3. It would slow down each loop by 1 second.
  4. This would allow me to see in slow motion what is happening in the loop.

5. Content Replace

Original:

replace content using sed double while loop

New Script:

double while loop bash

Explanation:

  1. This is an interesting script featuring a double while loop.
  2. It reads two file in parallel, our oldquotes and our newquotes
  3. Since sed replaces data on the fly on our json.
  4. We make a backup of our .json first using cp (for copy)

6. Putting it all together

 

python custom website scraper

Explanation:

  1. We start by prompts and taking a single command
  2. We launch it as an argument to quoteextract.py
  3. Check to see if number of quotes is less than 15.
  4. Discard if less and output number of quotes.
  5. If 15 or more we continue
  6. We sleep 1 at every step so we can see every step.
  7. Normally, without sleep, it would be fast and display all text at once.
  8. We want to see progression.
  9. Launching each bash script one after each other and detailing the step.

7. Limitations

For infinite page scroll, this scraper will work only partially. This is mainly for mainstream website scraping. Even though many websites now use infinite scrolling.

8. Source code

I hope you enjoyed this little tutorial. I will be working on an advanced scraper and post my results.

The source code can be found on my github repo

codarren at hackers dot mu

 

Sources

dark magic featured image

Facebook Comments