APIs give you direct but limited access to the data that you want from a software but sometimes you might find yourself in a scenario where there might not be an API to access the data you want, or the access to the API might be too limited or expensive.In these scenarios, web scraping would allow you to access the data as long as it is available on a website. Web scraping allows you to extract data from any website through the use of web scraping software.
Before you do web scraping you should know something called as DOM Document Object Model (DOM) which is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree.
All you need to do is visit any web page right click and select inspect or use ctrl + shift + i and it will show you the HTML that makes up the page.
*right click on a web page you get these options
Because the vast majority of code that uses the DOM revolves around manipulating HTML documents, it’s common to refer to the nodes in the DOM as elements, although strictly speaking not every node is an element.
You can search the DOM Tree by string, CSS selector, or XPath selector or You can edit the DOM on the fly and see how those changes affect the page. All the changes you make will be temporary in nature so do not worry about messing any code enjoy your learning experience.
Playing around the DOM helps you to locate the data that you want to extract from a web page that’s where web-scraping comes into the picture. There are many Python libraries available to perform web scraping the point is identifying which one will be more useful and is quick in providing the data you want and the way you want.
Python has libraries like Requests-HTML, lxml , Beautiful soup or a web scraping frame work Scrapy to provide you the data you want. Remember when it comes to a data scientist or a data analyst there is no control on the greed for more data every time you get into pulling the data you would always wish you had more and some other library is better than the previous one. Lets me just give my simple understanding about Requests-HTML, Beautiful soup & Scrapy and leave the rest up-to you to decide 🙂
Requests-HTML
The most basic Python library for web scraping is‘Requests-HTML’ lets us make HTML requests to the website’s server for retrieving the data on its page which helps you getting the HTML content of the web page and its suggested to use this library when you simply want to communicate with the websites. Its quite useful when you have to collect the links or the posts or specific element from the CSS section. It is very useful for simple and non recurring web scraping tasks and a to go to library for HTTP requests
Beautiful soup
Beautiful soup is library designed for quick scraping projects. It allows you to select and navigate the tree-like structure of HTML documents, searching for particular tags, attributes or ids. It also allows you to then further traverse the HTML documents through relations like children or siblings. In other words, with Beautiful Soup, you could first select a specific div tag and then search through all of its nested tags. Beautiful Soup as a library is popular because it is is easier to work with and well suited for beginners and easier to learn.It again very useful for simple and non recurring web scraping tasks and a to go to library for Data Parsing
Scrapy
Scrapy is not just a library but a complete framework for web scraping having a large ecosystem of developers contributing projects and support on Github & Stack Overflow. When it comes to recurring or large scale web scraping need all the heavy lifting that is required is smoothly done by Scrapy. It can get you multiple HTTP requests at the same time or create pipelines or extract data from dynamic websites. when you are just beginning to learn just remember it has a steep learning curve and you go to Scrapy for complete web scraping solution
Coming back to the greed for data the idea is never about collecting the data its about how you structure it and use it
Thats All folks for the Day — See you on the other side of the break 🙂 Happy Learning!!