5 minutes
Using PyMongo and Google Cloud DLP to Securely Process MongoDB Data
So how do you combine the power of PyMongo, a popular Python library for working with MongoDB, and Google Cloud’s Data Loss Prevention (DLP) API, which allows you to discover, classify, and redact sensitive information in your data?
Here’s a quick example that is NOT suitable for production use.
Below I will create a Python script that connects to a MongoDB instance, retrieve a random sample of documents, and then process each document through the Google Cloud DLP API to identify and handle sensitive information.
Prerequisites:
- Familiarity with Python and MongoDB
- A MongoDB instance with some data
- A Google Cloud Platform (GCP) project with DLP API enabled
- A GCP Service Account with JSON key file for authentication.
Step 1: Install necessary libraries
First, make sure you have the required libraries installed:
pip install pymongo google-cloud-dlp
Step 2: Set up Google Cloud credentials
Create a service account for your GCP project and download the JSON key file. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS
to the path of the key file. More information on this process can be found in the official documentation.
export GOOGLE_APPLICATION_CREDENTIALS="<path_to_your_key_file>"
Don’t use service account keys in production, or anywhere really.
Step 3: Connect to the MongoDB instance
Use PyMongo to connect to your MongoDB instance. Make sure to replace the connection string with your own credentials:
from pymongo import MongoClient
mongo_connection_string = "mongodb+srv://<username>:<password>@cluster0.mongodb.net/test?retryWrites=true&w=majority"
client = MongoClient(mongo_connection_string)
Step 4: Set up DLP
Define a function to process data through the Data Loss Prevention API.
from pymongo import MongoClient
from google.cloud import dlp
import random
def process_data_with_dlp(data):
dlp_client = DlpServiceClient()
# Replace this with the project ID of your Google Cloud project
project_id = "your-project-id"
# Configure the DLP settings
inspect_config = InspectConfig()
info_types=["PHONE_NUMBER", "EMAIL_ADDRESS"]
info_types = [{"name": info_type} for info_type in info_types]
# Construct the configuration dictionary.
inspect_config = {
"info_types": info_types,
"include_quote": True,
}
# Define the content to process
content_item = ContentItem()
content_item.value = data
# Process the data through the DLP API
response = dlp_client.inspect_content(
request={
"parent": f"projects/{project_id}",
"inspect_config": inspect_config,
"item": content_item
}
)
# Print the results
for finding in response.result.findings:
print(f"Found {finding.info_type.name} at position {finding.location.codepoint_range.start}:{finding.location.codepoint_range.end}")
Step 5: Connect to MongoDB
Query a random sample of documents from MongoDB.
# Replace the connection string with your own credentials
mongo_connection_string = "mongodb+srv://<username>:<password>@cluster0.mongodb.net/test?retryWrites=true&w=majority"
client = MongoClient(mongo_connection_string)
# Connect to the database and collection
db = client["your_database_name"]
collection = db["your_collection_name"]
# Retrieve a random sample of documents from the collection
sample_size = 10 # Change this value to the number of random documents you want
random_sample = collection.aggregate([{"$sample": {"size": sample_size}}])
Step 6: Put it all together
The whole script with plenty of room for improvement.
from pymongo import MongoClient
from google.cloud import dlp_v2
from google.cloud.dlp_v2 import DlpServiceClient
from google.cloud.dlp_v2.types import ContentItem, DeidentifyConfig, DeidentifyTemplate, InspectConfig, InspectTemplate, StorageConfig
import os
def process_data_with_dlp(data):
dlp_client = DlpServiceClient()
# Replace this with the project ID of your Google Cloud project
project_id = "your-project-id"
# Configure the DLP settings
inspect_config = InspectConfig()
info_types=["PHONE_NUMBER", "EMAIL_ADDRESS"]
info_types = [{"name": info_type} for info_type in info_types]
# Construct the configuration dictionary.
inspect_config = {
"info_types": info_types,
"include_quote": True,
}
# Define the content to process
content_item = ContentItem()
content_item.value = data
# Process the data through the DLP API
response = dlp_client.inspect_content(
request={
"parent": f"projects/{project_id}",
"inspect_config": inspect_config,
"item": content_item
}
)
# Print the results, f strings are cool
for finding in response.result.findings:
print(f"Found {finding.info_type.name} at position {finding.location.codepoint_range.start}:{finding.location.codepoint_range.end}")
# Replace the connection string with your own credentials
mongo_connection_string = "mongodb+srv://<username>:<password>@cluster0.mongodb.net/test?retryWrites=true&w=majority"
client = MongoClient(mongo_connection_string)
# Connect to the database and collection
db = client["your_database_name"]
collection = db["your_collection_name"]
# Retrieve a random sample of documents from the collection
sample_size = 10 # Change this value to the number of random documents you want
random_sample = collection.aggregate([{"$sample": {"size": sample_size}}])
# Process each document in the random sample through the DLP API
for doc in random_sample:
# Convert the document to a string (assuming it is a dictionary)
doc_data = str(doc)
print(f"Processing document: {doc['_id']}")
process_data_with_dlp(doc_data)
print("\n")
Improvements
The script is a basic example demonstrating the integration of PyMongo with Google Cloud DLP API. There is room for improvement and customization depending on your specific use case and requirements. Here are some recommendations:
-
Error handling: Add proper error handling to the script to ensure that it can handle exceptions or unexpected behavior, such as issues with MongoDB connectivity, invalid document structures, or issues with the Google Cloud DLP API.
-
Redaction: Instead of just identifying sensitive information, you could extend the script to redact or mask the sensitive data using Google Cloud DLP’s redaction capabilities. This could be useful for storing sanitized data or generating reports that don’t expose sensitive information.
-
Custom info types: The current script only looks for phone numbers and email addresses. You can extend the list of info types or create custom info types tailored to your specific needs, such as detecting custom patterns, regular expressions, or other types of sensitive information relevant to your use case.
-
Configurable settings: Make the script more flexible by allowing users to pass in configuration settings, such as MongoDB connection details, Google Cloud project ID, DLP settings, and the number of random documents to process, either through command-line arguments or a configuration file.
-
Logging: Incorporate proper logging to record the script’s activities, findings, and any errors encountered during execution. This will help with monitoring, debugging, and auditing the process.
-
Performance improvements: If you need to process a large number of documents, consider implementing parallelism or batch processing to improve the script’s performance. For example, you could use Python’s
concurrent.futures
module to process multiple documents concurrently or send multiple items to the DLP API in a single request using theinspect_content
method’sitems
parameter. -
Integration with other tools: Depending on your use case, you might want to integrate the script with other tools, such as sending alerts or notifications when sensitive information is detected or integrating with a data pipeline for further processing and analysis.
Remember that as you make improvements, you should also ensure that the script remains maintainable, modular, and easy to understand. This will facilitate future updates and customizations.