Build a LinkedIn Scraper Using Selenium and OpenAI’s GPT 4o-Mini

Fateh Ali Aamir
5 min readAug 12, 2024

--

Photo by Abid Shah on Unsplash

I wanted to create a LinkedIn scraper from scratch. I tried looking at various publicly available solutions but none of them worked for me. I realized that any traditional solution that worked by getting elements on the web page and extracting relevant data was highly prone to failure if any code changed. So I decided to throw GPT into the mix. I realized that I won’t need to do all that nitty-gritty element extraction if I can just get “all” of the text on the page and just ask the LLM to extract the information for me. After all, inference is one of its most powerful applications. So after gaining basic knowledge about Selenium and studying some existing solutions, I built one of my own. Let’s dive into it below.

actions.py

import getpass
from . import constants as c
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

def login(driver, email=None, password=None, cookie = None, timeout=30):
try:
driver.get("https://www.linkedin.com/login")
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located(("id", "username")))

email_elem = driver.find_element("id","username")
email_elem.send_keys(email)

password_elem = driver.find_element("id","password")
password_elem.send_keys(password)
password_elem.submit()

element = WebDriverWait(driver, timeout).until(EC.presence_of_element_located(("class name", "global-nav__content"))

except Exception as e:
print(f"Failed to log in: {e}")

We use the driver.get() function to get the LinkedIn login webpage. We use the WebDriverWait() function to check for a specific element on the webpage. We are using this function to wait so that the webpage fully loads. Once the page has loaded, we use the driver.find_element() to get the username id and the password id. Once we get those elements we use the send_keys() function to place our values into the text box. For the password element, we use the submit() function to log in. We then use WebDriverWait() again, this time checking for the existence of the global navigation bar that you see at the top. If we get that, it means we’re in.

entity.py

import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from .objects import Experience, Education, Scraper, Interest, Accomplishment, Contact
import os
from linkedin_scraper import selectors

class Entity(Scraper):

__TOP_CARD = "top-card-background-hero-image"
__WAIT_FOR_ELEMENT_TIMEOUT = 5

def __init__(
self,
linkedin_url=None,
driver=None,
get=True,
close_on_complete=True,
):
self.driver = driver

We created a class called Entity and we have two constants in it that will be used later on. The constructor initializes with a linkedin_url, driver, get and close_on_complete. It then sets the driver according to what was passed in when the constructor was called.

def scrape(self, close_on_complete=True):    
driver = self.driver

try:
# Wait for the page to load
WebDriverWait(driver, self.__WAIT_FOR_ELEMENT_TIMEOUT).until(
EC.presence_of_element_located(
(
"tag name", "body"
)
)
)
self.focus()
self.wait(5)

# Scroll to the bottom of the page to load all dynamic content
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
self.wait(3) # wait for additional content to load

# Get all the text on the page
page_text = driver.find_element("tag name", "body").text

print("Scraped Text:")
print(page_text)

return page_text

except Exception as e:
print(f"Failed to scrape the page: {e}")
page_text = ""

finally:
if close_on_complete:
driver.quit()

return page_text

The scrape() function is going to check for the existence of the body on the page. It will then use the driver.execute_script() function to scroll to the bottom of the page to load everything. After that, it will extract all of the text from the web page using driver.find_element(“tag name”, “body”).text and then we will print and return the scraped text.

linkedin_scraper.py

from linkedin_scraper import Person, actions
from openai import OpenAI
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_linkedin(linkedin_url, email, password):
# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in headless mode
chrome_options.add_argument("--disable-gpu") # Disable GPU acceleration
chrome_options.add_argument("--no-sandbox") # Bypass OS security model
chrome_options.add_argument("--disable-dev-shm-usage") # Overcome limited resource problems
chrome_options.add_argument("start-maximized") # Start maximized
chrome_options.add_argument("enable-automation") # Enable automation controls
chrome_options.add_argument("--disable-infobars") # Disable infobars
chrome_options.add_argument("--disable-extensions") # Disable extensions
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")

# Initialize the Chrome driver with the options
service = Service('/usr/bin/chromedriver') # Update with the correct path to your chromedriver
driver = webdriver.Chrome(service=service, options=chrome_options)
client = OpenAI(api_key="OPENAI_API_KEY")
if email and password:
try:

# Log in to LinkedIn
actions.login(driver, email, password)

# Wait for the LinkedIn homepage to load or for login to complete
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".global-nav__me-photo"))
)

# Navigate to the profile page
driver.get(linkedin_url)

# Wait for the profile page to load
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".pv-top-card__photo-wrapper.ml0"))
)

# Create an Entity object for the LinkedIn profile
entity = Entity(linkedin_url=linkedin_url, driver=driver, scrape=False)

# Scrape the LinkedIn profile data
linkedin_data = entity.scrape(close_on_complete=True) # Close browser after scraping

prompt = """Extract and summarize the LinkedIn profile data into the following format:
linkedin_data = {
"name": person.name,
"linkedin_url": person.linkedin_url,
"about": person.about,
"experiences": [str(exp) for exp in person.experiences],
"educations": [str(edu) for edu in person.educations],
"interests": [str(interest) for interest in person.interests],
"accomplishments": [str(accomplishment) for accomplishment in person.accomplishments],
"company": person.company,
"job_title": person.job_title
}

"""

response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.1,
messages=[
{"role": "system", "content": f'{prompt}'},
{"role": "user", "content": f'parse the following data: {linkedin_data}'}
]
)

print(response.choices[0].message.content)
return response.choices[0].message.content

except Exception as e:
print(f"Failed to scrape {linkedin_url}: {e}")
linkedin_data = {}

finally:
driver.quit()
else:
return {}

In our major scrape_linkedin() function we are first creating some necessary settings. These are chrome_options, services, driver and the OpenAI client. We then run the login() function with the driver, email and password. We then use the WebDriverWait() to check if we are at the home page. We check for the profile photo in the navigation bar. Then we use the driver.get() function to go to the profile we want to scrape.

Once we’re on that page, we will check for the profile picture using WebDriverWait() again. If we get that, we will go on to initialize our Entity object and then we run the scrape() function to get all of the text from the web page. Once we have that data, we can send it to OpenAI using the Chat Completions API and have it extract the required data from that big chunk of text. We can then print and return the data.

This entire solution was inspired by looking at the following Github repo: https://github.com/joeyism/linkedin_scraper. Do check it out.

--

--