Web Scraping using Python
A couple of simple examples of how to extract data online.
Web scraping can be a very useful tool for a wide range of problems. For example, it may allow to compute a new dataset, get high-frequency information on financial variables, adopt a pet (see link) or even look for a new job.
When I was applying for RA positions, I designed a code to email me new opening for RA positions posted at the NBER website on a daily basis.
Here a set of slides that walk through this example + the same example on a more complex website using selenium.
Finally, to email updates on the NBER I proceed as follows.
- load email-specific python packages
 
    import smtplib
    from email.mime.text import MIMEText
    from email.mime.multipart import MIMEMultipart
- 
Check if the stored dataset of job openings exist and then open it as a data frame:
df_old = pd.read_csv('current_jobs.csv',sep=';',index_col=False) - 
Look for differences in the old vis-a-vis new data.
diffs = list(set(df.id) ^ set(df_old.id)) # id: job opening id - 
If there are differences, create lists for added/gone openings.
# compute sets of added and removed jobs, create corresponding dataframes (email) added = list(set(df.id) - set(df_old.id)) gone = list(set(df_old.id) - set(df.id)) dfAdded = df[df['id'].isin(added)] dfGone = df_old[df_old['id'].isin(gone)] - 
Email new openings and filled positions:
 
    # send new jobs
    # subject
    if len(added) > 0:
        subject = 'New Opening Alert!'
    else:
        subject = 'Closed positions.'
    # sender
    sender = 'your_email@gmail.com'
    password = 'your_password'
    smtp_server = "smtp.gmail.com"
    port = 587
    # recepient
    recipient  = 'your_email@gmail.com'
    msg = MIMEMultipart()
    msg['From'] = sender
    msg['To'] = recipient
    msg['Subject'] = subject  
    message = f"""\
    Check changes in RA openings below 
    Opened:
    {dfAdded[['pos_name', 'researcher', 'institution', 'field', 'link']].stack().to_string(index=False)}
    Closed:
    {dfGone[['pos_name', 'researcher', 'institution', 'field', 'link']].stack().to_string(index=False)}
    """
    msg.attach(MIMEText(message,'plain')) 
    smtpObj = smtplib.SMTP(smtp_server, port)
    smtpObj.ehlo()
    smtpObj.starttls()
    smtpObj.login(sender, password)
    smtpObj.sendmail(sender, recipient, msg.as_string())
    smtpObj.quit()
    fileName = 'current_jobs.csv'
    df.to_csv(fileName,na_rep='.',sep=';',index=False)