Build a LinkedIn Scraper Using Selenium and OpenAI’s GPT 4o-Mini

5 min readAug 12, 2024

I wanted to create a LinkedIn scraper from scratch. I tried looking at various publicly available solutions but none of them worked for me. I realized that any traditional solution that worked by getting elements on the web page and extracting relevant data was highly prone to failure if any code changed. So I decided to throw GPT into the mix. I realized that I won’t need to do all that nitty-gritty element extraction if I can just get “all” of the text on the page and just ask the LLM to extract the information for me. After all, inference is one of its most powerful applications. So after gaining basic knowledge about Selenium and studying some existing solutions, I built one of my own. Let’s dive into it below.

actions.py

import getpass
from . import constants as c
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

def login(driver, email=None, password=None, cookie = None, timeout=30):
    try:   
        driver.get("https://www.linkedin.com/login")
        element = WebDriverWait(driver, 10).until(EC.presence_of_element_located(("id", "username")))
    
        email_elem = driver.find_element("id","username")
        email_elem.send_keys(email)
    
        password_elem = driver.find_element("id","password")
        password_elem.send_keys(password)
        password_elem.submit()
       
        element = WebDriverWait(driver, timeout).until(EC.presence_of_element_located(("class name", "global-nav__content"))
        
    except Exception as e:
        print(f"Failed to log in: {e}")

We use the driver.get() function to get the LinkedIn login webpage. We use the WebDriverWait() function to check for a specific element on the webpage. We are using this function to wait so that the webpage fully loads. Once the page has loaded, we use the driver.find_element() to get the username id and the password id. Once we get those elements we use the send_keys() function to place our values into the text box. For the password element, we use the submit() function to log in. We then use WebDriverWait() again, this time checking for the existence of the global navigation bar that you see at the top. If we get that, it means we’re in.

entity.py

import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from .objects import Experience, Education, Scraper, Interest, Accomplishment, Contact
import os
from linkedin_scraper import selectors

class Entity(Scraper):

    __TOP_CARD = "top-card-background-hero-image"
    __WAIT_FOR_ELEMENT_TIMEOUT = 5

    def __init__(
        self,
        linkedin_url=None,
        driver=None,
        get=True,
        close_on_complete=True,
    ):
        self.driver = driver

We created a class called Entity and we have two constants in it that will be used later on. The constructor initializes with a linkedin_url, driver, get and close_on_complete. It then sets the driver according to what was passed in when the constructor was called.

def scrape(self, close_on_complete=True):    
    driver = self.driver

    try:
        # Wait for the page to load
        WebDriverWait(driver, self.__WAIT_FOR_ELEMENT_TIMEOUT).until(
            EC.presence_of_element_located(
                (
                    "tag name", "body"
                )
            )
        )
        self.focus()
        self.wait(5)
        
        # Scroll to the bottom of the page to load all dynamic content
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        self.wait(3)  # wait for additional content to load
        
        # Get all the text on the page
        page_text = driver.find_element("tag name", "body").text
        
        print("Scraped Text:")
        print(page_text)
        
        return page_text
        
    except Exception as e:
        print(f"Failed to scrape the page: {e}")
        page_text = ""

    finally:
        if close_on_complete:
            driver.quit()

    return page_text

The scrape() function is going to check for the existence of the body on the page. It will then use the driver.execute_script() function to scroll to the bottom of the page to load everything. After that, it will extract all of the text from the web page using driver.find_element(“tag name”, “body”).text and then we will print and return the scraped text.

linkedin_scraper.py

from linkedin_scraper import Person, actions
from openai import OpenAI 
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_linkedin(linkedin_url, email, password):
    # Set up Chrome options
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # Run in headless mode
    chrome_options.add_argument("--disable-gpu")  # Disable GPU acceleration
    chrome_options.add_argument("--no-sandbox")  # Bypass OS security model
    chrome_options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource problems
    chrome_options.add_argument("start-maximized")  # Start maximized
    chrome_options.add_argument("enable-automation")  # Enable automation controls
    chrome_options.add_argument("--disable-infobars")  # Disable infobars
    chrome_options.add_argument("--disable-extensions")  # Disable extensions
    chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    
    # Initialize the Chrome driver with the options
    service = Service('/usr/bin/chromedriver')  # Update with the correct path to your chromedriver
    driver = webdriver.Chrome(service=service, options=chrome_options)
    client = OpenAI(api_key="OPENAI_API_KEY")
    if email and password:
        try:    
                
            # Log in to LinkedIn
            actions.login(driver, email, password)
                        
            # Wait for the LinkedIn homepage to load or for login to complete
            WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, ".global-nav__me-photo"))
            )
            
            # Navigate to the profile page
            driver.get(linkedin_url)
            
            # Wait for the profile page to load
            WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, ".pv-top-card__photo-wrapper.ml0"))
            )
            
            # Create an Entity object for the LinkedIn profile
            entity = Entity(linkedin_url=linkedin_url, driver=driver, scrape=False)
            
            # Scrape the LinkedIn profile data
            linkedin_data = entity.scrape(close_on_complete=True)  # Close browser after scraping
            
            prompt = """Extract and summarize the LinkedIn profile data into the following format:
                linkedin_data = {
                    "name": person.name,
                    "linkedin_url": person.linkedin_url,
                    "about": person.about,
                    "experiences": [str(exp) for exp in person.experiences],
                    "educations": [str(edu) for edu in person.educations],
                    "interests": [str(interest) for interest in person.interests],
                    "accomplishments": [str(accomplishment) for accomplishment in person.accomplishments],
                    "company": person.company,
                    "job_title": person.job_title
                }
            
            """
        
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                temperature=0.1,
                messages=[
                    {"role": "system", "content": f'{prompt}'},
                    {"role": "user", "content": f'parse the following data: {linkedin_data}'}
                ]
            )

            print(response.choices[0].message.content)
            return response.choices[0].message.content
            
        except Exception as e:
            print(f"Failed to scrape {linkedin_url}: {e}")
            linkedin_data = {}
        
        finally:
            driver.quit()
    else:
        return {}

In our major scrape_linkedin() function we are first creating some necessary settings. These are chrome_options, services, driver and the OpenAI client. We then run the login() function with the driver, email and password. We then use the WebDriverWait() to check if we are at the home page. We check for the profile photo in the navigation bar. Then we use the driver.get() function to go to the profile we want to scrape.

Once we’re on that page, we will check for the profile picture using WebDriverWait() again. If we get that, we will go on to initialize our Entity object and then we run the scrape() function to get all of the text from the web page. Once we have that data, we can send it to OpenAI using the Chat Completions API and have it extract the required data from that big chunk of text. We can then print and return the data.

This entire solution was inspired by looking at the following Github repo: https://github.com/joeyism/linkedin_scraper. Do check it out.

Build a LinkedIn Scraper Using Selenium and OpenAI’s GPT 4o-Mini

Written by Fateh Ali Aamir