Email Scraper

Parse Email

access_amherst_algo.email_scraper.email_parser.connect_and_fetch_latest_email(app_password, subject_filter, mail_server='imap.gmail.com')

Connect to the email server and fetch the latest email matching a subject filter.

This function connects to the specified IMAP email server (default is Gmail), logs in using the provided app password, and searches for the most recent email with a subject matching the subject_filter. It returns the email message object of the latest matching email.

Parameters:
  • app_password (str) -- The app password used for logging into the email account.

  • subject_filter (str) -- The subject filter used to search for specific emails.

  • mail_server (str, optional, default "imap.gmail.com") -- The IMAP email server address (default is 'imap.gmail.com').

Returns:

The latest email message matching the filter, or None if no matching email is found or login fails.

Return type:

email.message.Message or None

Examples

>>> email = connect_and_fetch_latest_email("amherst_college_password", "Amherst College Daily Mammoth for Sunday, November 3, 2024")
>>> if email:
>>>     print(email["From"])
'noreply@amherst.edu'
access_amherst_algo.email_scraper.email_parser.extract_email_body(msg)

Extract the body of an email message.

This function extracts and returns the plain-text body of the given email message. It handles both multipart and non-multipart emails, retrieving the text content from the message if available. If the email is multipart, it iterates over the parts to find the "text/plain" part and decodes it. If the email is not multipart, it directly decodes the payload.

Parameters:

msg (email.message.Message) -- The email message object from which to extract the body.

Returns:

The decoded plain-text body of the email, or None if no text content is found.

Return type:

str or None

Examples

>>> email_body = extract_email_body(email_msg)
>>> print(email_body)
'This is information about Amherst College events on Sunday, November 3, 2024.'
access_amherst_algo.email_scraper.email_parser.extract_event_info_using_llama(email_content)

Extract event info from the email content using the LLaMA API.

This function sends the provided email content to the LLaMA API for processing. It sends the email content along with an instruction to extract event details. If the API response is valid, the function parses and returns the extracted event information as a list of event JSON objects.

Parameters:

email_content (str) -- The raw content of the email to be processed by the LLaMA API.

Returns:

A list of event data extracted from the email content in JSON format. If extraction fails, an empty list is returned.

Return type:

list

Examples

>>> events = extract_event_info_using_llama("We're hosting a Literature Speaker Event this Tuesday, November 5, 2024 in Keefe Campus Center!")
>>> print(events)
[{"title": "Literature Speaker Event", "date": "2024-11-05", "location": "Keefe Campus Center"}]
access_amherst_algo.email_scraper.email_parser.parse_email(subject_filter)

Parse the email and extract event data.

This function connects to an email account, fetches the latest email based on the provided subject filter, extracts event information from the email body using the LLaMA API, and saves the extracted events to a JSON file. The file is saved with a timestamped filename in the 'json_outputs' directory.

Parameters:

subject_filter (str) -- The subject filter to identify the relevant email to fetch.

Returns:

This function does not return any value but logs status messages for each stage of the process.

Return type:

None

Examples

>>> parse_email("Amherst College Daily Mammoth for Sunday, November 3, 2024")
Email fetched successfully.
Events saved successfully to extracted_events_20231103_150000.json.
access_amherst_algo.email_scraper.email_parser.save_to_json_file(data, filename, folder)

Save the extracted events to a JSON file.

This function checks if the specified folder exists, creates it if it does not, and saves the provided event data to a JSON file with the specified filename. The data is saved with indentation for readability and structure.

Parameters:
  • data (dict or list) -- The data to be saved in JSON format. Typically, this would be a list or dictionary containing event data.

  • filename (str) -- The name of the file where the data will be saved (e.g., 'extracted_events_20241103_124530.json').

  • folder (str) -- The folder where the JSON file will be stored (e.g., 'json_outputs').

Returns:

This function does not return any value but writes data to a JSON file.

Return type:

None

Examples

>>> events = [{"title": "Literature Speaker Event", "date": "2024-11-05", "location": "Keefe Campus Center"}]
>>> save_to_json_file(events, "extracted_events_20241105_150000.json", "json_outputs")
Data successfully saved to json_outputs/extracted_events_20241105_150000.json

Save Email

access_amherst_algo.email_scraper.email_saver.is_similar_event(event_data)

Check if a similar event exists using timezone-aware datetime comparison.

This function compares the event's start and end times with existing database records to determine if a similar event already exists. It also checks for title similarity using a string similarity ratio.

Parameters:

event_data (dict) -- A dictionary containing event details such as title, start time, and end time.

Returns:

True if a similar event exists, otherwise False.

Return type:

bool

Examples

>>> event_data = {
>>>     "title": "Literature Speaker Event",
>>>     "starttime": "2024-11-05T18:00:00",
>>>     "endtime": "2024-11-05T20:00:00",
>>> }
>>> is_similar_event(event_data)
False
access_amherst_algo.email_scraper.email_saver.load_json_file(folder_path)

Load the most recent JSON file from the specified folder.

This function scans the given directory for JSON files, identifies the most recently modified JSON file, and loads its contents. If no JSON files are found, it returns None.

Parameters:

folder_path (str) -- The directory path where JSON files are stored.

Returns:

The parsed JSON data if a file is found, otherwise None.

Return type:

dict or list or None

Examples

>>> data = load_json_file("json_outputs")
>>> if data:
>>>     print(data)
[{'title': 'Literature Speaker Event', 'date': '2024-11-05', 'location': 'Keefe Campus Center'}]
access_amherst_algo.email_scraper.email_saver.parse_datetime(date_str, pub_date=None)

Parse datetime strings and convert them to UTC with an EST (UTC-5) offset.

This function attempts to parse a given date string using multiple formats, including standard date, ISO 8601, and RFC formats. If only a time is provided, it combines it with pub_date (if available) to create a full datetime object. The resulting datetime is then converted to UTC.

Parameters:
  • date_str (str) -- The date or time string to be parsed.

  • pub_date (str or datetime, optional) -- A reference date to be used when parsing a time-only string.

Returns:

A timezone-aware datetime object converted to UTC or None if parsing fails.

Return type:

datetime or None

Examples

>>> parse_datetime("2024-11-05T18:00:00")
datetime.datetime(2024, 11, 5, 13, 0, tzinfo=<UTC>)
>>> parse_datetime("18:00:00", "2024-11-05")
datetime.datetime(2024, 11, 5, 13, 0, tzinfo=<UTC>)
access_amherst_algo.email_scraper.email_saver.process_email_events()

Process and save events extracted from email JSON data.

This function loads the most recent JSON file containing extracted email event data, checks for duplicate events, and saves new events to the database.

Returns:

This function does not return any value but prints processing status messages.

Return type:

None

Examples

>>> process_email_events()
Skipping similar event: Literature Speaker Event
Successfully saved/updated event: New Music Festival
access_amherst_algo.email_scraper.email_saver.save_event_to_db(event_data)

Save an event to the database, allowing nullable start and end times.

This function processes event data by generating a unique ID, parsing date fields, and saving or updating the event in the database.

Parameters:

event_data (dict) -- A dictionary containing event details, including title, start and end times, location, categories, and other metadata.

Returns:

This function does not return any value but prints a success or failure message.

Return type:

None

Examples

>>> event_data = {
>>>     "title": "Literature Speaker Event",
>>>     "starttime": "2024-11-05T18:00:00",
>>>     "endtime": "2024-11-05T20:00:00",
>>>     "location": "Keefe Campus Center",
>>>     "categories": ["Lecture", "Workshop"]
>>> }
>>> save_event_to_db(event_data)
Successfully saved/updated event: Literature Speaker Event