Handling sensitive information is a critical aspect of software development, especially when dealing with user data. Whether it’s masking Social Security Numbers (SSNs), email addresses, or URLs, ensuring that this data is appropriately scrubbed is essential for maintaining privacy and security. In this post, I’ll share my experience creating a Python function that efficiently scrubs sensitive information using regular expressions (regex). We’ll explore how to design the function for flexibility and performance, ensuring it meets various data scrubbing needs.
Thank me by sharing on Twitter 🙏
Understanding the Importance of Data Scrubbing
In many applications, displaying raw sensitive information can lead to privacy breaches and security vulnerabilities. For instance, exposing SSNs in logs or user interfaces can be exploited maliciously. Similarly, revealing email addresses or URLs might compromise user privacy or lead to spam. By implementing a scrubbing mechanism, we can replace or mask these sensitive details, ensuring that only non-sensitive information is visible to users or stored in logs.
Designing the Scrub Regex Function
The core of our solution is a Python function named scrub_regex
. This function takes an array of pattern objects and an input string, then processes the string to replace sensitive information based on the provided patterns. Here’s how I approached building this function:
import re
from typing import List, Dict, Pattern, Any
# Default patterns for SSN, email, and URL scrubbing
default_patterns = [
{
"regex": re.compile(r"\d{3}-\d{2}-\d{4}"), # SSN pattern
"minLength": 11,
"replacement": "***-**-****",
"shortCircuit": True, # Stop after replacing SSN
},
{
"regex": re.compile(
r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", re.IGNORECASE
), # Email pattern
"minLength": 5,
"replacement": "[EMAIL]",
"shortCircuit": False, # Continue processing even after replacing email
},
{
"regex": re.compile(r"https?://\S+"), # URL pattern
"minLength": 10,
"replacement": "[URL]",
"shortCircuit": False, # Continue processing even after replacing URL
},
]
def scrub_regex(patterns: List[Dict[str, Any]], input_str: str) -> str:
"""
Scrubs the input string based on the provided precompiled patterns with optional short-circuiting.
Each pattern in the patterns list should be a dictionary with the following keys:
- 'regex': A compiled regular expression pattern (re.Pattern object).
- 'minLength': An integer specifying the minimum length required to apply this regex.
- 'replacement': A string to replace matches of the regex.
- 'shortCircuit' (optional): A boolean indicating whether to stop processing further patterns after a replacement.
The function processes each pattern in order. For each pattern:
1. It checks if the length of `input_str` is at least `minLength`.
2. If so, it applies the regex substitution.
3. If a substitution occurs:
- If `shortCircuit` is True, the function returns the modified string immediately.
- If `shortCircuit` is False or not specified, it continues processing remaining patterns.
Args:
patterns (List[Dict[str, Any]]): A list of precompiled pattern dictionaries.
input_str (str): The string to be scrubbed.
Returns:
str: The scrubbed string with replacements applied based on pattern configurations.
"""
result_str = input_str # Initialize with the original string
for pattern in patterns:
compiled_regex: Pattern = pattern.get("regex")
min_length: int = pattern.get("minLength", 0)
replacement: str = pattern.get("replacement", "")
short_circuit: bool = pattern.get(
"shortCircuit", True
) # Default to True if not specified
if not compiled_regex:
continue # Skip patterns without a compiled regex
if len(result_str) >= min_length:
new_str, num_subs = compiled_regex.subn(replacement, result_str)
if num_subs > 0:
result_str = new_str # Update the string with replacements
if short_circuit:
break # Exit early if short-circuiting is enabled for this pattern
return result_str # Return the modified string
This function iterates through each pattern, checks if the input string meets the minimum length requirement, and then applies the regex substitution. If a replacement occurs and shortCircuit
is enabled, the function halts further processing, ensuring efficiency by not processing unnecessary patterns.
Optimizing Performance with Precompiled Regex Patterns
Regular expressions can be computationally intensive, especially when processed repeatedly in high-frequency scenarios. To optimize performance, I precompiled the regex patterns. Precompiling ensures that each regex is compiled only once, reducing overhead during multiple function calls.
Co-Intelligence: Living and Working with AI
$17.79 (as of April 1, 2025 14:16 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)HP 910XL Black High-yield Ink Cartridge | Works with HP OfficeJet 8010, 8020 Series, HP OfficeJet Pro 8020, 8030 Series | Eligible for Instant Ink | 3YL65AN
$47.89 (as of April 2, 2025 14:18 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Careless People: A Cautionary Tale of Power, Greed, and Lost Idealism
$17.71 (as of April 1, 2025 14:16 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)By organizing the patterns this way, the function can efficiently reuse the compiled regex objects, enhancing overall performance, especially in applications where scrub_regex
is invoked frequently.
Adding Flexible Short-Circuit Control
Flexibility is vital when dealing with diverse data scrubbing needs. Sometimes, you might want the function to stop processing after a specific replacement, while other times, you might prefer to continue processing all patterns. To accommodate this, each pattern includes a shortCircuit
flag.
For example, when an SSN is detected and replaced, setting shortCircuit
to True
ensures that the function doesn’t proceed to replace emails or URLs within the same string. Conversely, for emails and URLs, setting shortCircuit
to False
allows multiple replacements within the same input string.
This design choice provides granular control over the scrubbing process, allowing you to prioritize certain types of data over others based on your application’s requirements.
Practical Usage of the Scrub Regex Function
To see the scrub_regex
function in action, let’s consider some example use cases:
if __name__ == "__main__":
test_strings = [
"My SSN is 123-45-6789.",
"Contact me at example@example.com.",
"Visit https://www.example.com for more info.",
"Short 123.",
"No sensitive info here."
]
for s in test_strings:
scrubbed = scrub_regex(default_patterns, s)
print(f"Original: {s}\nScrubbed: {scrubbed}\n")
Expected Output
Original: My SSN is 123-45-6789.
Scrubbed: My SSN is ***-**-****.
Original: Contact me at example@example.com.
Scrubbed: Contact me at [EMAIL].
Original: Visit https://www.example.com for more info.
Scrubbed: Visit [URL] for more info.
Original: Short 123.
Scrubbed: Short 123.
Original: No sensitive info here.
Scrubbed: No sensitive info here.
In the first case, the SSN is detected and replaced with ***-**-****
. Since shortCircuit
is set to True
for the SSN pattern, the function stops further processing, and the email remains unchanged. In the second example, the email is replaced with [EMAIL]
, and because shortCircuit
is False
, the function continues to check for URLs, but none are present in this string.
Handling Multiple Replacements and Overlapping Patterns
Consider a scenario where an input string contains both an email and a URL:
input_str = "Contact me at example@example.com or visit https://www.example.com."
scrubbed = scrub_regex(default_patterns, input_str)
print(scrubbed)
Output:
Contact me at [EMAIL] or visit [URL].
Here, the email is replaced first, and since shortCircuit
for the email pattern is False
, the function continues to replace the URL as well. This demonstrates the function’s ability to handle multiple replacements efficiently.
Dealing with Edge Cases
It’s essential to ensure that the scrub_regex
function handles various edge cases gracefully. For instance, what happens if the input string is empty or doesn’t contain any patterns?
input_str = ""
scrubbed = scrub_regex(default_patterns, input_str)
print(scrubbed) # Output: (empty string)
input_str = "No sensitive info here."
scrubbed = scrub_regex(default_patterns, input_str)
print(scrubbed) # Output: "No sensitive info here."
In both cases, the function returns the original string unchanged, as expected. This behavior ensures that the function doesn’t introduce unintended modifications when there’s nothing to scrub.
Extending the Function for Additional Patterns
While the default patterns cover SSNs, emails, and URLs, the scrub_regex
function is designed to be flexible. You can easily add more patterns to handle other types of sensitive information. For example, to scrub credit card numbers:
credit_card_pattern = {
"regex": re.compile(r"\b\d{4}-\d{4}-\d{4}-\d{4}\b"),
"minLength": 19,
"replacement": "****-****-****-****",
"shortCircuit": False,
}
# Add to the patterns list
patterns_extended = default_patterns + [credit_card_pattern]
input_str = "My credit card number is 1234-5678-9012-3456."
scrubbed = scrub_regex(patterns_extended, input_str)
print(scrubbed) # Output: "My credit card number is ****-****-****-****."
This example showcases how easily the function can be extended to handle additional patterns, making it a versatile tool for various data scrubbing needs.
Ensuring Robustness with Error Handling
While the scrub_regex
function is robust, it’s good practice to include error handling to manage unexpected scenarios. For instance, what if a pattern lacks the regex
key or if the input isn’t a string?
In the current implementation, patterns missing the regex
key are skipped, ensuring that the function doesn’t break. However, if the input isn’t a string, the function might raise an AttributeError
. To enhance robustness, you might consider adding type checks or try-except blocks to handle such cases gracefully.
def scrub_regex(patterns: List[Dict[str, Any]], input_str: str) -> str:
if not isinstance(input_str, str):
raise ValueError("Input must be a string.")
result_str = input_str # Initialize with the original string
for pattern in patterns:
compiled_regex: Pattern = pattern.get("regex")
min_length: int = pattern.get("minLength", 0)
replacement: str = pattern.get("replacement", "")
short_circuit: bool = pattern.get(
"shortCircuit", True
) # Default to True if not specified
if not compiled_regex:
continue # Skip patterns without a compiled regex
if len(result_str) >= min_length:
try:
new_str, num_subs = compiled_regex.subn(replacement, result_str)
except re.error as e:
print(f"Invalid regex pattern: {e}")
continue # Skip invalid regex patterns
if num_subs > 0:
result_str = new_str # Update the string with replacements
if short_circuit:
break # Exit early if short-circuiting is enabled for this pattern
return result_str # Return the modified string
By incorporating such checks, you ensure that the function remains reliable even when faced with unexpected inputs or pattern configurations.
Reflecting on the Development Process
Building the scrub_regex
function was an insightful journey into balancing flexibility, performance, and reliability. By leveraging precompiled regex patterns, I optimized the function for scenarios where it’s invoked frequently, reducing computational overhead. Introducing the shortCircuit
flag provided granular control, allowing specific patterns to dictate whether the function should halt after a replacement. This design ensures that the function can adapt to various data scrubbing requirements without sacrificing efficiency.
Moreover, by anticipating and handling edge cases, I enhanced the function’s robustness, making it a dependable tool for protecting sensitive information across different applications.
Final Thoughts
Protecting sensitive information is paramount in today’s data-driven world. Implementing effective scrubbing mechanisms, like the scrub_regex
function, is a proactive step towards safeguarding privacy and maintaining trust. With its flexible design and optimized performance, this function serves as a reliable solution for managing sensitive data across diverse applications. As data protection standards evolve, having adaptable and efficient tools at your disposal becomes increasingly essential.