Scraper (Web) explanation + full source code 
This is a follow-up article from yesterday. On Sunday, I showed how to scrape data from websites using python and bash. Today, we are going to merge the parts to form a whole.
If you haven’t read my previous article, you can do so by clicking here.
Week Ly is a browser extension available on chrome, firefox and opera. It shows the current week we are in. It also displays a random quote when clicked on. Up till yesterday, I was adding the quotes manually in a .json file.
The idea is to scrape quotes from the website: Brainyquote.
We were able yesterday to do it with a series of bash and python scripts. Today, we are making it a full application.
/!\ Website scraping should be fair, you should put proper attribution to the website in question /!\
Essential parts for our quote scraper
- The source website – brainyquote
- Quote extractor gets new quotes from source – quoteextract.py
- The data -quotes, authors, noise is stripped from excessive quotes and noise- excessremoval.sh
- The clean data with new quotes is formatted by adding double quotes – formatter.sh
- We take the old quotes from the .json file and extract it in a .txt file – dataextract.sh
- We use the new quotes and old quotes file, to supplement our quotes .json file – contentreplace.sh
1. Quote extractor
New reusable module:
- Parse a topic as an argument.
- Use the main url of brainyquote.
- Concatenate the topic to the url.
- Check if the topic/url exists. If no, we exit.
- Topic exists. Create a new file brainy_quotes.txt or truncate if exists.
- Scrape the quotes to our file: brainy_quotes.txt
2. Excess/Noise Remover
New Cleaned module:
- Since we already know the name of the output file we are getting.
- This module stays pretty much the same.
- We remove the comments.
- And delete the temporary files.
- This script changed significantly.
- Since it was printing text on the screen and not saving anywhere.
- Forget the 2nd line for a minute.
- We start by reading a line.
- Replace the dashes by newline.
- Remove whitespaces
- Add quotes using awk
- Redirecting the input to formatted.txt
- Now we go back to line 2. We removed formatted.txt if it exists. Why? Simply because we are appending to a file and if it exists, more garbage would be appended. We start with a clean formatted.txt
- Update: Due to issues with tr supporting only single character, I had to replace tr with sed. In cases where there are multiple dashes, we want to consider only the last one !
4. JSON data extractor
- Just removed sleep 1 comment
- sleep 1 was useful during debugging.
- It would slow down each loop by 1 second.
- This would allow me to see in slow motion what is happening in the loop.
5. Content Replace
- This is an interesting script featuring a double while loop.
- It reads two file in parallel, our oldquotes and our newquotes
- Since sed replaces data on the fly on our json.
- We make a backup of our .json first using cp (for copy)
6. Putting it all together
- We start by prompts and taking a single command
- We launch it as an argument to quoteextract.py
- Check to see if number of quotes is less than 15.
- Discard if less and output number of quotes.
- If 15 or more we continue
- We sleep 1 at every step so we can see every step.
- Normally, without sleep, it would be fast and display all text at once.
- We want to see progression.
- Launching each bash script one after each other and detailing the step.
For infinite page scroll, this scraper will work only partially. This is mainly for mainstream website scraping. Even though many websites now use infinite scrolling.
8. Source code
I hope you enjoyed this little tutorial. I will be working on an advanced scraper and post my results.
The source code can be found on my github repo
codarren at hackers dot mu