In this Tutorial, we will learn about scrapping websites using Python and Selenium module. This Script and Technique will help you to scrap nearly all Websites. Works for all pages in unsplash.com
In the following section we will write a python script to scrap the download links of first 10 photos from a given category in Unsplash and store it in a text file.
Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is the main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be to find and copy names and phone numbers, or companies and their URLs, to a list (contact scraping).
Go to the official repository to download geckdriver if you don’t have it yet. Follow this link https://github.com/mozilla/geckodriver/releases
The folder structure should look like this:
Download the full configuration from my github account.
Copy and run the following code:
from selenium import webdriver browser = webdriver.Firefox() url = "https://unsplash.com/search/photos/mountains/" browser.get(url)
If you face any error please comment below. I will be happy to help. 😁
If everything went well you will see a firefox tab opening up and the given url will open.
Before we start I would like you to go to the website and inspect the source code. You will find an interesting thing that all download links have the title = “Download photo”. We will use this info to separate the download link from other links. This will be our flow for developing the Script.
Download the full configuration from my github account.
from selenium import webdriver def view_webpage(link_file): try: elem1 = browser.find_elements_by_tag_name('a') except: print('some error occured') try: for elem in elem1: if elem.get_attribute('title') == 'Download photo': print(elem.get_attribute('href'), file=link_file) except: print("No data in Element") browser = webdriver.Firefox() search_term = "mountains/" url = "https://unsplash.com/search/photos/" + search_term browser.get(url) complete = False # we will open the file in append mode link_file = open("links.txt", mode="a+") while not complete: view_webpage(link_file) complete = True # Closing the file to save in drive link_file.close()
Voila!! It worked. Here are the links you will get in link_file.txt.
Stay tuned for my upcoming blog post to get the Improved Version of the Script at pyblog.in, New Script will let download as many photos you want and will support multi-threading.
If you get struck anywhere feel free to comment down below. I will be happy to help. 😁
This blog post is for educational purpose only.
In Python, the print() function is a fundamental tool for displaying output. While printing simple…
Python is a versatile programming language known for its simplicity and flexibility. When working on…
PDF (Portable Document Format) files are commonly used for sharing documents due to their consistent…
PDF (Portable Document Format) files are widely used for document exchange due to their consistent…
Python is a high-level programming language known for its simplicity and ease of use. However,…
Object-Oriented Programming (OOP), iterators, generators, and closures are powerful concepts in Python that can be…
This website uses cookies.