Selenium Tutorial: Scraping Glassdoor.com in 10 Minutes

Omer Sakarya
7 min readOct 14, 2019

--

I scraped jobs data from Glassdoor.com for a project. Let me tell you how I did it…

What is Scraping?

It’s a method for collecting information from web pages.

Why Scraping?

Other than the fact that it is fun, Glassdoor’s library provides a limited number of data points. It doesn’t allow you to scrape jobs or reviews. You only get to scrape companies, which was useless in my case. In this guide, I will share my way of doing it along with the Jupyter Notebook.

Why Selenium?

Good question! Glassdoor renders its content with Javascript. Which means that a simple get request to the webpage below would return only the visible content. We are interested in more than that.

There are data points such as company valuation and job location under the “Company” tab and we want to access that information as well. The webpage does not show that content unless the user clicks on the “Company” tab. This makes clicking on the “Company” tab necessary. Using requests library and doing simple get requests would not work for this type of website. Therefore, the only way to scrape that data is to write a program that mimics a human user. Selenium is a library that lets you code a python script that would act just like a human user.

What we will build?

Essentially we will be building a python script that would give us a DataFrame like this:

Job listings DataFrame, scraped from Glassdoor.com

The python script will exactly do the following. Search for a given keyword, click on each job all on the job listings all the way down, click on different tabs in the job description panel, scrape all the data. When it reaches the end of the list, it will go to the next page of results and keep doing the same thing, until a target number of jobs scraped. The final result will simulate Google chrome like the following. Everything is automated, no human interaction involved below.

Pre-requisites

  • Working knowledge of python
  • python version 3.x
  • Jupyter notebook installed
  • Very basic knowledge of html
  • Some knowledge of XPath would also go a long way.
  • selenium library installed
  • ChromeDriver placed on a directory you know.

The Core Principle of Web Scraping

It is quite simple. Let’s assume you have an eye on some web page element (such as text) and would like to scrape it. Elements on a web page reside in a hierarchy. You just need to tell your code the “address” of the element within that hierarchy.

Let’s take the headquarters information of a company as an element of interest. In the picture below, the headquarters of this company is San Francisco, CA. Therefore, we should scrape “San Francisco, CA”. All you need to do is, right-click and “Inspect Element”. That will open a small window with the html content of the web page, with your element highlighted for you. You also see where the elements are placed in the hierarchy.

You can not see the rest of the hierarchy in the picture but it does not really matter anyway. Because we will use relative address in order to explain where the element is. We are going to say, look for a div, with a class of “infoEntity” that has a label as a child, which has “Headquarters” as text, and take the text inside the following sibling of it. We are just going to say it in a formal way. The XPath below does just that. We do not have to mention the whole hierarchy because chances are, this description fits only this html element.

#Also putting the code in text form in case you'd like to copy and paste.try:
headquarters = driver.find_element_by_xpath('.//div[@class="infoEntity"]//label[text()="Headquarters"]//following-sibling::*').text
except NoSuchElementException:
headquarters = -1

If you understand what we’re doing here, it means that you can easily scrape almost any webpage you’d like. We are basically describing the long, awkward sentence in the paragraph above in a formal way. Feel free to take a deeper look into XPath if you’d like.

Also, notice that we have put the statement inside a try-catch clause. This is good practice if you are scraping web elements since they might not even exist. For glassdoor’s case, not all fields are always available. A company might not have put up their headquarters to glassdoor. If that’s the case, for the data fields, you might want to assign a “not found” value. I have chosen -1 for that purpose.

The Code

You can download the Jupyter notebook here . If you were able to understand what we did above and have a basic knowledge of python, you should be able to read the code with comments. For the rest of this post, I will explain little tricks in the code that are not so obvious why.

Note that you need to change the path to the chromedriver so that it points to the chromedriver in your local file system.

Bypassing the Sign-Up Prompt

Bypassing the sign-up prompt

One challenge scraping the glassdoor was that if you have not signed up, which in our case you won’t, as soon as you click anywhere on the screen, it asks you to sign up. A new window comes up and blocks selenium from clicking anywhere else. This is the case every time selenium does a new search or go to the next page of job listings. The way we are dealing with this is that we click the X button to close it right after clicking on an arbitrary job posting on the website. In the code, the arbitrary job is the topmost job listing. It comes selected as default. Therefore that element’s class name is “selected”. We can find and click on it by find_element_by_class_name(“selected”).

How About Glassdoor API?

As of today, Glassdoor does not have any public API for Jobs. Which means that you have to do scraping if you want to get data about job posting. I heard you saying thank you. You are welcome.

Also, Glassdoor does not have an API for reviews either, which might be of interest to you. I would recommend this tutorial in case you would like to obtain reviews.

Getting rejected or throttled by Glassdoor

One other point about Glassdoor’s website is that, if you get too aggressive about get requests, Glassdoor will start rejecting or throttling your connections. In other words, if you were to do a get request to multiple links at once, it is likely that your IP address will get blocked, or you’ll have a slow connection. Those things leave us no other option than writing a script that browses the website like a human would, clicking through jobs listings, without raising too many eyebrows on Glassdoor’s servers.

It is annoying that a new Chrome window opens every time I run the script.

I totally hear you on that. You can also let selenium do all the scraping in the background, without opening a browser window, although it might get tricky. This way of scraping is called “headless”. You need to paste this code between your driver definition and driver.get and your script should do the scraping in the background.

options.add_argument(‘headless’)

Pros

Small chance of getting caught by glassdoor and your http requests starting to getting rejected or throttled.

You are getting more data fields than Glassdoor’s API. In fact, you are able to get even more if you extend the code for scraping the data points you are looking for.

Cons

This method is suitable if you are looking to scrape couple hundreds of jobs. For thousands, you might want to leave your laptop overnight or have multiple jupyter notebooks do the scraping simultaneously, which gets messier since it increases the density of http requests to Glassdoor, which in turn, increases your IP address to be blocked.

THE END 🎉 🎊

Hope you enjoyed reading, cheers! 😊 👏

--

--