Lambda function to send an email | Fabrizio Napolitano

Sometimes there are actions which are repetitive and can be automated. For example, looking for job openings in a specific field can be very time consuming. In my country, all public selections ongoing ( at labs or universities ) are published on a specific website, and people check twice per week if there is something new. Unfortunately, the website does not have any filter operations, and you have to scroll down the list of all the selections to see if there is something relevant for you.

Relevant xkcd.

In this post, I will explain how to use AWS’s lambda function to send an email to yourself after parsing the pdf published on a website. For those who are not familiar with AWS, it is a cloud service provider, and lambda is a service that allows you to run code without provisioning or managing servers. Lambda is a serverless computing service, and you can run your code for virtually any type of application or backend service with zero administration.

This are the relevant steps:

Get a AWS account; it will be free for the first months
Create a lambda function. You can use the AWS console, or the AWS CLI, I did use the console.

For the lambda function, the best is to create a virtual envirnonment with the libraries you need, and then zip it and upload it to the lambda function. This is how it goes:

mkdir my_lambda
cd my_lambda
python3 -m venv my_lambda
source my_lambda/bin/activate
pip install requests PyPDF2 # the two libraries I need
cd vev/lib/python3.11/site-packages

Now, after you have the libraries you can add the lambda_function.py file, which is the code you want to run. In my case, I want to parse a pdf and send an email if there is a new selection.

import json
import requests
import PyPDF2 # for the parsing of the pdf
import boto3 # for the email service
import datetime
import io

ses = boto3.client('ses') # simple email service, needs further configuration on the AWS console

# function to send an email
def sendEmail(subject='testmail',body='testmail'):

    ses.send_email(
	    Source = 'your@email.com',
	    Destination = {
		    'ToAddresses': [
			    'your@email.com'
		    ]
	    },
	    Message = {
		    'Subject': {
			    'Data': subject,
			    'Charset': 'UTF-8'
		    },
		    'Body': {
			    'Text':{
				    'Data': body,
				    'Charset': 'UTF-8'
			    }
		    }
	    }
    )

# lets setup some useful functions
def download_pdf(pdf_url):
    response = requests.get(pdf_url)
    return io.BytesIO(response.content)  # Return a BytesIO object to avoid saving to disk

def extract_text_from_pdf(pdf_bytes):
    reader = PyPDF2.PdfReader(pdf_bytes)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text
    
# i  need this cause the text will be sent per email, and i basically need to format a string
def format_search_results(search_results):
    result_str = ""
    for keyword, details in search_results.items():
        # Add the keyword and occurrences
        result_str += f"Keyword: '{keyword}', Occurrences: {details['count']}\n"
        result_str += "Snippets:\n"
        
        # Add each snippet with a new line
        for snippet in details['snippets']:
            result_str += f"- {snippet}\n"
        
        # Add a newline for better separation between keywords
        result_str += "\n"
    
    return result_str

# Fixed-date public holidays in Italy. On these dates, the Gazzetta Ufficiale is not published.
def is_fixed_holiday(date_to_check):
    year = date_to_check.year
    fixed_holidays = {
        datetime.date(year, 1, 1),   # New Year's Day
        datetime.date(year, 1, 6),   # Epiphany
        datetime.date(year, 4, 25),  # Liberation Day
        datetime.date(year, 5, 1),   # Labor Day
        datetime.date(year, 6, 2),   # Republic Day
        datetime.date(year, 8, 15),  # Assumption Day (Ferragosto)
        datetime.date(year, 11, 1),  # All Saints' Day
        datetime.date(year, 12, 8),  # Feast of the Immaculate Conception
        datetime.date(year, 12, 25), # Christmas Day
        datetime.date(year, 12, 26)  # St. Stephen's Day
    }
    
    # Check if the date is in the set of fixed holidays
    return date_to_check in fixed_holidays

# the function to count the number of tuesdays and fridays from the beginning of the year
# this is needed because the gazzetta number is the sum of the two
def count_fridays_and_tuesdays(start_date, end_date):
    # Initialize counters
    friday_count = 0
    tuesday_count = 0

    # Iterate over each date between start_date and end_date
    current_date = start_date
    while current_date <= end_date:
        if is_fixed_holiday(current_date): # Skip fixed holidays
            pass
        elif current_date.weekday() == 1:  # Tuesday (weekday() == 1)
            tuesday_count += 1
        elif current_date.weekday() == 4:  # Friday (weekday() == 4)
            friday_count += 1
        
        # Move to the next day
        current_date += datetime.timedelta(days=1)

    return friday_count, tuesday_count

# function to search for keywords in the text
def search_keywords(text, keywords, snippet_radius=50):
    # Convert the entire text to lowercase for case-insensitive search
    lower_text = text.lower()
    
    # Store results for each keyword
    results = {}

    for keyword in keywords:
        # Initialize list to hold snippets for this keyword
        snippets = []
        
        # Convert keyword to lowercase for case-insensitive search
        lower_keyword = keyword.lower()

        # Find occurrences of the keyword in the text
        start = 0
        while True:
            # Find the next occurrence of the keyword
            index = lower_text.find(lower_keyword, start)
            if index == -1:
                break  # No more occurrences

            # Determine snippet boundaries
            start_index = max(index - snippet_radius, 0)
            end_index = min(index + len(keyword) + snippet_radius, len(text))

            # Extract the snippet from the original text (not the lowercase version)
            snippet = (
                text[start_index:index] + "[" + text[index:index + len(keyword)] + "]" + text[index + len(keyword):end_index]
            )
            
            snippets.append(snippet)

            # Update start position to continue searching after the current keyword
            start = index + len(keyword)

        # Store the snippets and the total count of occurrences for each keyword
        results[keyword] = {
            'count': len(snippets),
            'snippets': snippets
        }
    
    return results

# this is the main function
def lambda_handler(event, context):
    # create url for the pdf
    # get year month day and gazzetta number from the actual date 
    today = datetime.date.today()
    # today = datetime.date(2024, 9, 3)
    year = today.year
    month = today.month
    day = today.day

    # the variable gazzetta, is the number of the gazzetta, counted as the number of tuesdays and friday from the beginning of the year
    start_date = datetime.date(year, 1, 1)

    end_date = today
    friday_count, tuesday_count = count_fridays_and_tuesdays(start_date, end_date)
    gazzetta = friday_count + tuesday_count 

    # in my case i can alogrithmically create the url. I want the lambda to get the pdf from the correct url at the give time of the month
    pdf_url = 'https://www.gazzettaufficiale.it/do/gazzetta/downloadPdf?dataPubblicazioneGazzetta=' +f"{year:04}{month:02}{day:02}"+'&numeroGazzetta='+str(gazzetta)+'&tipoSerie=S4&tipoSupplemento=GU&numeroSupplemento=0&progressivo=0&estensione=pdf&edizione=0'

    # the keywords I am looking for
    keywords = ['PHYS-01','ISTITUTO NAZIONALE DI ASTROFISICA','ISTUTUTO NAZIONALE DI FISICA NUCLEARE','AGENZIA SPAZIALE ITALIANA','ENEA','AGENZIA NAZIONALE PER LE NUOVE TECNOLOGIE']

    response = ''

    # Download PDF and extract text
    try:
        pdf_bytes = download_pdf(pdf_url)
        pdf_text = extract_text_from_pdf(pdf_bytes)

        # Search keywords
        search_results = search_keywords(pdf_text, keywords)

        # format the response
        response = "Gazzetta URL: " + pdf_url + "\n\n"
        response += format_search_results(search_results)
        sendEmail('Recap da Gazzetta Ufficiale',response)

        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({
                'message': 'An error occurred.',
                'error': str(e)
            })
        }

then you can zip the folder and upload it to the lambda function.

zip -r my_lambda.zip .

and then upload it to the lambda function (in the Code section of lambda).

to create a trigger, you can use a CloudWatch Event, which is a service that allows you to schedule when your lambda function should run. You can create a rule that triggers the lambda function at a specific time or at a specific interval. I want to creat the trigger at 20h00 of every Tuesday and Friday, cause the website is updated on those days. Can use a cron expression to specify the schedule. The cron expression for 20h00 on Tuesday and Friday is 0 20 ? * TUE,FRI *.

Finally you have to allow the perimission of the Lambda to use the SES service, which is the email service of AWS. You can do this in the IAM section of the AWS console. Go to IAM -> Policies and create a policy with the following json

{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Effect": "Allow",
          "Action": [
              "ses:SendEmail",
              "ses:SendRawEmail"
          ],
          "Resource": "*"
      }
  ]
}

also add to your policy AWSLambdaBasicExecutionRole if not there.

Then go to your lambda function, Configuration->Permissions and attach the policy you just created.

This should be it!

Now if the function works correctly you will receive an email with the text of the pdf, and you can check if there is something relevant for you.

Here is my nice example (the snippets are not very useful in this case, but you get the idea):

As a final note, be very careful of the costs of the lambda function, as it can be very expensive if you do not set the correct limits. You can set the limits in the Configuration section of the lambda function. In general it is very inexpensive to run, but be aware that pontially misconfigurations could be very costly.

Enjoy Reading This Article?