Web Scraping using Python

A couple of simple examples of how to extract data online.

Web scraping can be a very useful tool for a wide range of problems. For example, it may allow to compute a new dataset, get high-frequency information on financial variables, adopt a pet (see link) or even look for a new job.

When I was applying for RA positions, I designed a code to email me new opening for RA positions posted at the NBER website on a daily basis.

Here a set of slides that walk through this example + the same example on a more complex website using selenium.

Finally, to email updates on the NBER I proceed as follows.

  1. load email-specific python packages
    import smtplib
    from email.mime.text import MIMEText
    from email.mime.multipart import MIMEMultipart
  1. Check if the stored dataset of job openings exist and then open it as a data frame:

    df_old = pd.read_csv('current_jobs.csv',sep=';',index_col=False)
  2. Look for differences in the old vis-a-vis new data.

    diffs = list(set(df.id) ^ set(df_old.id)) # id: job opening id
  3. If there are differences, create lists for added/gone openings.

    # compute sets of added and removed jobs, create corresponding dataframes (email)
    added    = list(set(df.id) - set(df_old.id))
    gone     = list(set(df_old.id) - set(df.id))
    dfAdded  = df[df['id'].isin(added)]
    dfGone   = df_old[df_old['id'].isin(gone)]
  4. Email new openings and filled positions:

    # send new jobs

    # subject
    if len(added) > 0:
        subject = 'New Opening Alert!'
        subject = 'Closed positions.'

    # sender
    sender = 'your_email@gmail.com'
    password = 'your_password'
    smtp_server = "smtp.gmail.com"
    port = 587

    # recepient
    recipient  = 'your_email@gmail.com'

    msg = MIMEMultipart()
    msg['From'] = sender
    msg['To'] = recipient
    msg['Subject'] = subject  

    message = f"""\
    Check changes in RA openings below 
    {dfAdded[['pos_name', 'researcher', 'institution', 'field', 'link']].stack().to_string(index=False)}
    {dfGone[['pos_name', 'researcher', 'institution', 'field', 'link']].stack().to_string(index=False)}

    smtpObj = smtplib.SMTP(smtp_server, port)
    smtpObj.login(sender, password)
    smtpObj.sendmail(sender, recipient, msg.as_string())

    fileName = 'current_jobs.csv'
