How to Monitor IP Addresses, Alert with Twilio, and Track Acknowledgments in Python — The Perfect On-Call Script

Monitoring the uptime of critical IP addresses is an essential task for any system administrator or network engineer. When an IP goes down, you want to be alerted immediately and ensure that the issue is being handled. In this article, we will walk through setting up a Python-based solution that will:

  1. Monitor 10 IP addresses every 15 minutes to check if they are operational.
  2. Send SMS alerts via Twilio when an IP address fails.
  3. Track acknowledgment of the issue from the responsible on-call employee using Twilio.
  4. Log every error event and acknowledgment in an SQLite database, tracking when errors occur, when alerts are sent, and if/when the issue is acknowledged.

Step 1: Setting Up the Database

To start, you’ll need an SQLite database with two tables:

  1. Employees table: Contains the employee on-call information (name, schedule, and phone number to receive text messages).
  2. Error log table: Tracks each IP failure, when the failure occurred, when the alert was sent, and whether the issue was acknowledged.

Create the Employee Table

The employee table will store the employee’s name, schedule, and phone number.

CREATE TABLE employees (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    start_date DATE NOT NULL,
    end_date DATE NOT NULL,
    start_time TIME NOT NULL,
    end_time TIME NOT NULL,
    phone_number TEXT NOT NULL
);

Create the Error Log Table

This table will track each error and its corresponding acknowledgment status. Each row will represent an error event, including timestamps for when alerts are sent and when acknowledgments are received.

CREATE TABLE error_logs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ip_address TEXT NOT NULL,
    error_timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    alert_sent_timestamp DATETIME,
    acknowledgment_timestamp DATETIME,
    acknowledged INTEGER DEFAULT 0,
    employee_id INTEGER,
    FOREIGN KEY (employee_id) REFERENCES employees(id)
);

Step 2: Python Code to Monitor IPs and Send Alerts

Below is the Python code to monitor the IP addresses and handle the alert and acknowledgment process using Twilio.

Required Libraries:

  • socket: For testing if an IP is reachable.
  • sqlite3: For database operations.
  • twilio: To send SMS messages.
  • Flask: For receiving SMS replies through Twilio webhooks.

First, install the necessary Python packages:

pip install twilio flask

Here’s the Python code to monitor the IPs and manage the error logging and alert system:

import socket
import sqlite3
import time
from twilio.rest import Client
from datetime import datetime

# Twilio account information
account_sid = 'your_account_sid'
auth_token = 'your_auth_token'
client = Client(account_sid, auth_token)

# List of IP addresses to monitor
ip_addresses = ['192.168.1.1', '192.168.1.2', '192.168.1.3', ...]

# Function to check IP address status
def check_ip(ip):
    try:
        socket.gethostbyname(ip)
        return True
    except socket.error:
        return False

# Function to fetch the appropriate employee based on date and time
def get_employee():
    conn = sqlite3.connect('monitoring.db')
    cursor = conn.cursor()
    
    current_date = datetime.now().strftime('%Y-%m-%d')
    current_time = datetime.now().strftime('%H:%M:%S')
    
    cursor.execute('''
        SELECT id, name, phone_number 
        FROM employees 
        WHERE start_date <= ? AND end_date >= ? 
        AND start_time <= ? AND end_time >= ?
        LIMIT 1;
    ''', (current_date, current_date, current_time, current_time))
    
    employee = cursor.fetchone()
    conn.close()
    
    if employee:
        return {'id': employee[0], 'name': employee[1], 'phone_number': employee[2]}
    return None

# Function to log errors in the database
def log_error(ip, employee_id):
    conn = sqlite3.connect('monitoring.db')
    cursor = conn.cursor()
    
    cursor.execute('''
        INSERT INTO error_logs (ip_address, employee_id) 
        VALUES (?, ?)
    ''', (ip, employee_id))
    
    conn.commit()
    error_id = cursor.lastrowid
    conn.close()
    return error_id

# Function to update when an alert is sent
def update_alert_sent(error_id):
    conn = sqlite3.connect('monitoring.db')
    cursor = conn.cursor()
    
    cursor.execute('''
        UPDATE error_logs
        SET alert_sent_timestamp = CURRENT_TIMESTAMP
        WHERE id = ?
    ''', (error_id,))
    
    conn.commit()
    conn.close()

# Function to send an alert via Twilio
def send_alert(employee, ip, error_id):
    message = client.messages.create(
        body=f"IP Address {ip} is down. Please investigate.",
        from_='+1234567890',  # Twilio number
        to=employee['phone_number']
    )
    update_alert_sent(error_id)  # Update database with alert sent time
    return message.sid

# Function to update acknowledgment status in the database
def update_acknowledgment(error_id):
    conn = sqlite3.connect('monitoring.db')
    cursor = conn.cursor()
    
    cursor.execute('''
        UPDATE error_logs
        SET acknowledged = 1, acknowledgment_timestamp = CURRENT_TIMESTAMP
        WHERE id = ?
    ''', (error_id,))
    
    conn.commit()
    conn.close()

# Function to handle acknowledgment and retries
def wait_for_acknowledgment(employee, ip, error_id):
    for _ in range(3):  # Retry 3 times (15 minutes total)
        # Simulate checking for an acknowledgment (This will later be replaced by Twilio webhooks)
        response = input(f"Has {employee['name']} acknowledged the issue? (Type 'Acknowledged'):\n")
        
        if response.lower() == "acknowledged":
            update_acknowledgment(error_id)
            print("Acknowledgment received, no further alerts.")
            return True
        print("No acknowledgment, retrying...")
        time.sleep(300)  # Wait 5 minutes (300 seconds)
    
    print(f"Resending alert to {employee['name']} for IP: {ip}")
    send_alert(employee, ip, error_id)
    return False

# Main monitoring loop
def monitor_ips():
    while True:
        for ip in ip_addresses:
            if not check_ip(ip):
                employee = get_employee()
                if employee:
                    error_id = log_error(ip, employee['id'])
                    alert_sid = send_alert(employee, ip, error_id)
                    print(f"Alert sent to {employee['name']} for IP {ip}. SID: {alert_sid}")
                    wait_for_acknowledgment(employee, ip, error_id)
                else:
                    print(f"No employee available to handle the issue for IP: {ip}")
        time.sleep(900)  # Wait 15 minutes before next check

# Start monitoring
monitor_ips()

Step 3: Handling Acknowledgment with Twilio Webhooks

To automate acknowledgment, you need Twilio to receive incoming messages from the employees. Here’s how you can do that:

  1. Set up a Flask app to handle incoming SMS replies from Twilio.
  2. Check if the reply contains “Acknowledged” and update the database accordingly.

Flask App for Receiving Acknowledgments

from flask import Flask, request
from twilio.twiml.messaging_response import MessagingResponse
import sqlite3

app = Flask(__name__)

@app.route("/sms", methods=['POST'])
def sms_reply():
    """Handle incoming SMS"""
    incoming_msg = request.form.get('Body').strip().lower()
    from_number = request.form.get('From')
    
    # Check if the message is an acknowledgment
    if incoming_msg == 'acknowledged':
        conn = sqlite3.connect('monitoring.db')
        cursor = conn.cursor()
        
        # Find the latest error associated with this phone number
        cursor.execute('''
            SELECT id 
            FROM error_logs 
            WHERE employee_id = (SELECT id FROM employees WHERE phone_number = ?) 
            AND acknowledged = 0 
            ORDER BY error_timestamp DESC 
            LIMIT 1
        ''', (from_number,))
        
        error = cursor.fetchone()
        if error:
            update_acknowledgment(error[0])  # Update acknowledgment in the error log
        
        conn.close()
        
        # Respond with confirmation
        response = MessagingResponse()
        response.message("Thank you for acknowledging the alert.")
        return str(response)
    
    return "Invalid response", 400

def update_acknowledgment(error_id):
    conn = sqlite3.connect('monitoring.db')
    cursor = conn.cursor()
    
    cursor.execute('''
        UPDATE error_logs
        SET acknowledged = 1, acknowledgment_timestamp = CURRENT_TIMESTAMP
        WHERE id = ?
    ''', (error_id,))
    
    conn.commit()
    conn.close()

if __name__ == "__main__":
    app.run(debug=True)

Step 4: Configuring Twilio Webhooks

To receive the SMS replies, you need to configure Twilio’s webhook to point to the Flask app’s /sms endpoint:

  1. Go to your Twilio console.
  2. Under the Messaging section, select your Twilio phone number.
  3. In the A Message Comes In field, add the URL of your Flask app (e.g., http://yourserver.com/sms).

Step 5: Putting It All Together

The Python script now:

  1. Checks the status of each IP address every 15 minutes.
  2. Logs any errors in the error_logs table.
  3. Sends an SMS alert to the appropriate employee.
  4. Waits for acknowledgment, either manually or through the Twilio webhook.
  5. Retries sending the alert if no acknowledgment is received.

The database stores all relevant information about IP failures, when alerts are sent, and whether the issue has been acknowledged. This ensures a full audit trail of issues and their resolutions.


Scenario Walk-Through

Let’s walk through what happens when the IP address 10.10.10.10 fails in this system:

Step 1: Monitoring IPs

The Python script is running, and every 15 minutes it checks the status of a list of IP addresses, including 10.10.10.10. The script tries to connect to this IP using the socket.gethostbyname() function to see if it’s reachable.

Step 2: Detecting Failure

If 10.10.10.10 is unreachable, the script detects this as a failure. At this point, it initiates the following actions:

  1. Find the On-Call Employee:
  • The script queries the employees table in the SQLite database to find which employee is responsible for handling failures at the current date and time.
  • For example, if the failure occurs at 3 PM on a weekday, the script will check which employee is scheduled during that time window.
  1. Log the Error:
  • The failure is logged into the error_logs table with the IP address (10.10.10.10), the timestamp of the failure, and the ID of the employee who will be responsible for handling it. The alert_sent_timestampacknowledgment_timestamp, and acknowledged fields are initially empty because no alert has been sent or acknowledged yet.

Step 3: Sending the Alert via Twilio

Once the failure is logged, the script sends an SMS alert using Twilio to the on-call employee’s phone number. The message contains information about the failure, for example:

“IP Address 10.10.10.10 is down. Please investigate.”

At this point, the alert_sent_timestamp field in the error_logs table is updated with the time the alert was sent.

Step 4: Waiting for Acknowledgment

After the alert is sent, the system waits for the employee to acknowledge the message. The employee has two ways to respond:

  1. Acknowledge the Issue:
  • The employee responds to the SMS with the word “Acknowledged”.
  • The reply is handled by the Twilio webhook and the Flask app. When the webhook receives the acknowledgment message, it checks the error_logs table to find the latest unacknowledged error for that employee.
  • The system updates the acknowledgment_timestamp and marks the acknowledged field as 1 (true) in the database. This tells the system that the issue is being handled.
  • The system stops sending further alerts for this issue because it has been acknowledged.
  1. No Acknowledgment:
  • If the employee doesn’t reply within 15 minutes, the script checks the database and sees that the acknowledged field is still 0 (false).
  • The system will resend the alert SMS to the employee, repeating the message that “IP Address 10.10.10.10 is down.”
  • This process is repeated up to three times (for a total of 45 minutes) if no acknowledgment is received.

Step 5: Resending Alerts (if necessary)

If the employee does not acknowledge the failure after three attempts (over 45 minutes), the system will continue retrying based on the retry logic configured in the script. At each retry, the error_logs table is updated to indicate the failure to receive an acknowledgment.

Step 6: Issue Resolution and Logging

Once the issue is acknowledged, the following happens:

  1. The acknowledgment is logged in the error_logs table with a timestamp.
  2. The system stops sending further alerts for this issue.
  3. The database now has a complete log of the failure, including the exact times of the failure, when the alert was sent, when (or if) it was acknowledged, and who was responsible for handling it.

Stepped Scenario

Let’s say 10.10.10.10 fails at 2:30 PM on a weekday:

  1. The system checks the IP and detects the failure.
  2. It looks up the employee scheduled to handle issues at 2:30 PM. Let’s say Alice is on duty from 9 AM to 5 PM.
  3. The system logs the error and sends an SMS alert to Alice’s phone number: “IP Address 10.10.10.10 is down. Please investigate.”
  4. Alice sees the SMS and replies with “Acknowledged.” This acknowledgment is received by the Flask app.
  5. The system logs the acknowledgment and stops sending alerts for this issue.

If Alice didn’t respond:

  • The system would send another alert after 15 minutes and repeat this process up to 3 times until she acknowledges.

The database records all of this, including:

  • The time of the failure.
  • The time the alert was sent.
  • Whether and when the issue was acknowledged.

Thank you for following along with this tutorial. We hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials/articles, please feel free to join and comment. Your feedback and suggestions are always welcome!

You can find the same tutorial on Medium.com.

Leave a Reply