implementing-cloud-dlp-for-data-protection▌
mukul975/Anthropic-Cybersecurity-Skills · updated May 25, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
Implementing Cloud Data Loss Prevention (DLP) using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data across cloud storage, databases, and data pipelines.
| name | implementing-cloud-dlp-for-data-protection |
| description | 'Implementing Cloud Data Loss Prevention (DLP) using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data across cloud storage, databases, and data pipelines. ' |
| domain | cybersecurity |
| subdomain | cloud-security |
| tags | - cloud-security - dlp - data-protection - macie - data-classification - privacy |
| version | '1.0' |
| author | mahipal |
| license | Apache-2.0 |
| nist_ai_rmf | - MEASURE-2.7 - MAP-5.1 - MANAGE-2.4 - MEASURE-2.8 - MEASURE-2.9 |
| atlas_techniques | - AML.T0070 - AML.T0066 - AML.T0082 |
| nist_csf | - PR.IR-01 - ID.AM-08 - GV.SC-06 - DE.CM-01 |
Implementing Cloud DLP for Data Protection
When to Use
- When compliance frameworks (GDPR, HIPAA, PCI DSS) require automated sensitive data discovery and protection
- When building data governance programs that classify and label data across cloud storage
- When implementing data loss prevention controls for cloud-based data pipelines
- When auditing cloud environments for unprotected sensitive data (PII, PHI, financial data)
- When integrating DLP scanning into CI/CD pipelines to prevent sensitive data from reaching production
Do not use for endpoint DLP (use Microsoft Purview or Symantec DLP agents), for email DLP (use Microsoft 365 DLP or Google Workspace DLP), or for network-level data exfiltration prevention (use VPC endpoint policies and network firewalls).
Prerequisites
- Amazon Macie enabled with appropriate S3 bucket permissions
- Google Cloud DLP API enabled (
gcloud services enable dlp.googleapis.com) - Azure Information Protection or Microsoft Purview configured
- IAM permissions for DLP service administration and data access
- Knowledge of data sensitivity categories relevant to the organization (PII, PHI, PCI, proprietary)
Workflow
Step 1: Deploy Amazon Macie for S3 Data Discovery
Enable Macie and configure automated sensitive data discovery jobs for S3 buckets.
# Enable Amazon Macie
aws macie2 enable-macie
# List all S3 buckets Macie can scan
aws macie2 describe-buckets \
--query 'buckets[*].[bucketName,classifiableSizeInBytes,unclassifiableObjectCount.total]' \
--output table
# Create a classification job for specific buckets
aws macie2 create-classification-job \
--job-type SCHEDULED \
--name "weekly-pii-scan" \
--schedule-frequency-details '{"weekly":{"dayOfWeek":"MONDAY"}}' \
--s3-job-definition '{
"bucketDefinitions": [{
"accountId": "ACCOUNT_ID",
"buckets": ["customer-data-bucket", "analytics-data-lake", "backup-bucket"]
}],
"scoping": {
"includes": {
"and": [{
"simpleScopeTerm": {
"key": "OBJECT_EXTENSION",
"values": ["csv", "json", "parquet", "txt", "xlsx"],
"comparator": "EQ"
}
}]
}
}
}' \
--managed-data-identifier-ids '["SSN","CREDIT_CARD_NUMBER","EMAIL_ADDRESS","AWS_CREDENTIALS","PHONE_NUMBER"]'
# Create custom data identifier for internal employee IDs
aws macie2 create-custom-data-identifier \
--name "EmployeeID" \
--regex "EMP-[0-9]{6}" \
--description "Internal employee ID format"
# Check job status and results
aws macie2 list-classification-jobs \
--query 'items[*].[name,jobStatus,statistics.approximateNumberOfObjectsToProcess]' \
--output table
Step 2: Configure Google Cloud DLP API for Data Inspection
Use Google Cloud DLP to inspect and de-identify sensitive data across GCP resources.
# Inspect a Cloud Storage bucket for sensitive data
gcloud dlp inspect-content \
--content-type=TEXT_PLAIN \
--min-likelihood=LIKELY \
--info-types=PHONE_NUMBER,EMAIL_ADDRESS,CREDIT_CARD_NUMBER,US_SOCIAL_SECURITY_NUMBER \
--storage-type=CLOUD_STORAGE \
--gcs-uri="gs://sensitive-data-bucket/data/*.csv"
# Create an inspection job for BigQuery
cat > dlp-job.json << 'EOF'
{
"inspectJob": {
"storageConfig": {
"bigQueryOptions": {
"tableReference": {
"projectId": "PROJECT_ID",
"datasetId": "customer_data",
"tableId": "transactions"
},
"sampleMethod": "RANDOM_START",
"rowsLimit": 10000
}
},
"inspectConfig": {
"infoTypes": [
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "PERSON_NAME"}
],
"minLikelihood": "LIKELY",
"limits": {"maxFindingsPerRequest": 1000}
},
"actions": [{
"saveFindings": {
"outputConfig": {
"table": {
"projectId": "PROJECT_ID",
"datasetId": "dlp_results",
"tableId": "findings"
}
}
}
}]
}
}
EOF
gcloud dlp jobs create --project=PROJECT_ID --body-from-file=dlp-job.json
Step 3: Implement Data De-identification with Cloud DLP
Configure de-identification transforms to mask, tokenize, or redact sensitive data.
# deidentify_pipeline.py - De-identify sensitive data using Google Cloud DLP
from google.cloud import dlp_v2
def deidentify_data(project_id, text):
"""De-identify PII in text using Cloud DLP."""
client = dlp_v2.DlpServiceClient()
inspect_config = {
"info_types": [
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
],
"min_likelihood": dlp_v2.Likelihood.LIKELY,
}
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"info_types": [{"name": "EMAIL_ADDRESS"}],
"primitive_transformation": {
"character_mask_config": {
"masking_character": "*",
"number_to_mask": 0,
"characters_to_ignore": [
{"common_characters_to_ignore": "PUNCTUATION"}
],
}
},
},
{
"info_types": [{"name": "CREDIT_CARD_NUMBER"}],
"primitive_transformation": {
"crypto_replace_ffx_fpe_config": {
"crypto_key": {
"kms_wrapped": {
"wrapped_key": "WRAPPED_KEY_BASE64",
"crypto_key_name": "projects/PROJECT/locations/global/keyRings/dlp/cryptoKeys/tokenization",
}
},
"common_alphabet": "NUMERIC",
}
},
},
{
"info_types": [{"name": "US_SOCIAL_SECURITY_NUMBER"}],
"primitive_transformation": {
"redact_config": {}
},
},
]
}
}
item = {"value": text}
parent = f"projects/{project_id}/locations/global"
response = client.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"inspect_config": inspect_config,
"item": item,
}
)
return response.item.value
Step 4: Configure Azure Information Protection
Set up sensitivity labels and DLP policies in Microsoft Purview for Azure resources.
# Connect to Microsoft Purview compliance
Connect-IPPSSession
# Create sensitivity labels
New-Label -DisplayName "Confidential - PII" \
-Name "Confidential-PII" \
-Tooltip "Contains personally identifiable information" \
-ContentType "File, Email"
New-Label -DisplayName "Highly Confidential - Financial" \
-Name "HighlyConfidential-Financial" \
-Tooltip "Contains financial data subject to PCI DSS" \
-ContentType "File, Email"
# Create auto-labeling policy for Azure Storage
New-AutoSensitivityLabelPolicy -Name "Auto-Label-PII" \
-ExchangeLocation All \
-SharePointLocation All \
-OneDriveLocation All \
-Mode Enable
New-AutoSensitivityLabelRule -Policy "Auto-Label-PII" \
-Name "Detect-SSN" \
-ContentContainsSensitiveInformation @{
Name = "U.S. Social Security Number (SSN)";
MinCount = 1;
MinConfidence = 85
} \
-ApplySensitivityLabel "Confidential-PII"
# Azure: Configure DLP policy for Storage accounts
az security assessment create \
--name "storage-sensitive-data" \
--assessed-resource-type "Microsoft.Storage/storageAccounts"
# Enable Microsoft Defender for Storage with sensitive data threat detection
az security pricing create --name StorageAccounts --tier standard \
--subplan DefenderForStorageV2 \
--extensions '[{"name":"SensitiveDataDiscovery","isEnabled":"True"}]'
Step 5: Integrate DLP into Data Pipelines
Add DLP scanning to ETL and data pipeline workflows to prevent sensitive data leakage.
# pipeline_dlp_gate.py - DLP gate for data pipelines
import boto3
import json
macie_client = boto3.client('macie2')
s3_client = boto3.client('s3')
def scan_pipeline_output(bucket, prefix):
"""Scan pipeline output data for sensitive content before promotion."""
job_response = macie_client.create_classification_job(
jobType='ONE_TIME',
name=f'pipeline-scan-{prefix}',
s3JobDefinition={
'bucketDefinitions': [{
'accountId': boto3.client('sts').get_caller_identity()['Account'],
'buckets': [bucket]
}],
'scoping': {
'includes': {
'and': [{
'simpleScopeTerm': {
'key': 'OBJECT_KEY',
'comparator': 'STARTS_WITH',
'values': [prefix]
}
}]
}
}
},
managedDataIdentifierSelector='ALL'
)
return job_response['jobId']
def check_scan_results(job_id):
"""Check if DLP scan found sensitive data."""
response = macie_client.list_findings(
findingCriteria={
'criterion': {
'classificationDetails.jobId': {'eq': [job_id]},
'severity.description': {'eq': ['High', 'Critical']}
}
}
)
return len(response.get('findingIds', [])) > 0
def gate_decision(bucket, prefix):
"""DLP gate: block pipeline if sensitive data found."""
job_id = scan_pipeline_output(bucket, prefix)
has_sensitive_data = check_scan_results(job_id)
if has_sensitive_data:
return {
'decision': 'BLOCK',
'reason': 'Sensitive data detected in pipeline output',
'action': 'Apply de-identification before promoting to production'
}
return {'decision': 'ALLOW', 'reason': 'No sensitive data detected'}
Step 6: Monitor DLP Findings and Generate Reports
Aggregate DLP findings across cloud providers and generate compliance reports.
# Macie: Get finding statistics
aws macie2 get-finding-statistics \
--group-by "severity.description" \
--finding-criteria '{"criterion":{"category":{"eq":["CLASSIFICATION"]}}}'
# Macie: List findings by sensitivity type
aws macie2 list-findings \
--finding-criteria '{
"criterion": {
"classificationDetails.result.sensitiveData.category": {"eq": ["PERSONAL_INFORMATION"]},
"severity.description": {"eq": ["High"]}
}
}' \
--sort-criteria '{"attributeName": "updatedAt", "orderBy": "DESC"}'
# GCP DLP: List job results
gcloud dlp jobs list --project=PROJECT_ID --filter="state=DONE" \
--format="table(name, createTime, inspectDetails.result.processedBytes, inspectDetails.result.totalEstimatedTransformations)"
# Export Macie findings to S3 for compliance reporting
aws macie2 create-findings-report \
--finding-criteria '{"criterion":{"category":{"eq":["CLASSIFICATION"]}}}' \
--sort-criteria '{"attributeName":"severity.score","orderBy":"DESC"}'
Key Concepts
| Term | Definition |
|---|---|
| Data Loss Prevention | Security controls and technologies that detect and prevent unauthorized disclosure of sensitive data from cloud environments |
| Amazon Macie | AWS service using machine learning to discover, classify, and protect sensitive data stored in S3 buckets |
| Google Cloud DLP | GCP API for inspecting, classifying, and de-identifying sensitive data across Cloud Storage, BigQuery, and Datastore |
| Data De-identification | Transforming sensitive data using masking, tokenization, encryption, or redaction to remove identifying characteristics while preserving utility |
| Sensitivity Label | Classification tag applied to data (Confidential, Highly Confidential) that triggers DLP policy enforcement and access controls |
| Custom Data Identifier | Organization-specific pattern (regex or keyword) added to DLP services to detect proprietary sensitive data formats |
Tools & Systems
- Amazon Macie: ML-powered sensitive data discovery and classification for S3 with automated finding generation
- Google Cloud DLP API: Programmable API for inspecting, classifying, de-identifying, and redacting sensitive data
- Microsoft Purview: Data governance platform with sensitivity labeling, auto-classification, and DLP policy enforcement
- Azure Information Protection: Data classification and labeling service integrated with Microsoft 365 and Azure storage
- Nightfall AI: Third-party cloud DLP tool supporting scanning across SaaS applications and cloud infrastructure
Common Scenarios
Scenario: Discovering PII in an Unprotected S3 Data Lake
Context: A compliance audit reveals that the analytics team's S3 data lake contains customer PII (names, emails, SSNs) in CSV files without encryption or access controls. The organization must classify all data and implement DLP controls.
Approach:
- Enable Macie and create a one-time classification job against the data lake bucket
- Review Macie findings to identify which objects contain PII and what types
- Create custom data identifiers for organization-specific formats (employee IDs, account numbers)
- Implement a weekly scheduled Macie job for ongoing discovery
- Build a data pipeline gate that scans new data before promotion to the data lake
- Apply de-identification transforms (masking SSNs, tokenizing emails) for analytics use cases
- Configure S3 bucket policies to restrict access to classified data to authorized roles only
Pitfalls: Macie charges per GB scanned. Large data lakes can generate significant costs. Use scoping rules to focus on high-risk object types (CSV, JSON, Parquet) and exclude known-safe formats (compressed archives, binary files). De-identification must preserve data utility for analytics while removing re-identification risk.
Output Format
Cloud DLP Compliance Report
==============================
Organization: Acme Corp
Scan Period: 2026-02-01 to 2026-02-23
Environments: AWS (12 buckets), GCP (3 datasets), Azure (5 storage accounts)
DATA DISCOVERY SUMMARY:
Total objects/records scanned: 2,847,000
Objects with sensitive data: 45,200 (1.6%)
Unique sensitivity categories: 8
SENSITIVE DATA FINDINGS:
PII (names, emails, phone): 23,400 objects
Financial (credit cards, bank): 8,700 objects
Health (PHI, medical records): 3,200 objects
Credentials (API keys, tokens): 1,400 objects
Government ID (SSN, passport): 5,800 objects
Custom (employee ID, account): 2,700 objects
FINDINGS BY SEVERITY:
Critical: 1,400 (exposed credentials)
High: 14,200 (unprotected PII/PHI)
Medium: 18,600 (standard PII)
Low: 11,000 (non-sensitive patterns)
PROTECTION STATUS:
Data with encryption at rest: 78%
Data with access controls: 65%
Data with sensitivity labels: 12%
Pipeline data with DLP gates: 30%
REMEDIATION ACTIONS:
Objects quarantined: 1,400
De-identification applied: 8,200
Access controls tightened: 14,200
Sensitivity labels applied: 45,200
How to use implementing-cloud-dlp-for-data-protection on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your development machine
- ›Node.js version 16.0+ with npm package manager (verify with
node --version) - ›Active project directory or workspace where you want to add implementing-cloud-dlp-for-data-protection
Execute installation command
Execute the skills CLI command in your project's root directory to begin installation:
The skills CLI fetches implementing-cloud-dlp-for-data-protection from GitHub repository mukul975/Anthropic-Cybersecurity-Skills and configures it for Cursor.
Select Cursor when prompted
The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Reload or restart Cursor to activate implementing-cloud-dlp-for-data-protection. Access the skill through slash commands (e.g., /implementing-cloud-dlp-for-data-protection) or your agent's skill management interface.
Security & Verification Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases▌
Exploratory Data Analysis
Quickly understand datasets, identify patterns, and generate insights
Example
Analyze CSV with 100K rows, identify outliers, visualize correlations, suggest hypotheses
Reduce EDA time from hours to minutes, uncover insights faster
Data Cleaning & Transformation
Write scripts to clean messy data, handle missing values, normalize formats
Example
Generate Python/SQL to fix date formats, impute missing values, remove duplicates
Automate 80% of data preprocessing work
Statistical Analysis
Perform hypothesis testing, regression, and statistical modeling
Example
Run A/B test analysis, calculate confidence intervals, interpret p-values
Get statistically sound analysis without PhD in statistics
Data Visualization
Create charts, dashboards, and visual reports
Example
Generate matplotlib/seaborn code for time series plots, distribution charts, heatmaps
Build presentation-ready visualizations 3x faster
Implementation Guide▌
Prerequisites
- ›Claude Desktop or compatible AI client
- ›Python environment (pandas, numpy, matplotlib) or SQL database access
- ›Basic understanding of data analysis concepts
- ›Sample datasets for testing skill capabilities
Time Estimate
20-40 minutes to set up and run first analysis
Installation Steps
- 1.Install data analysis skill using provided command
- 2.Prepare a sample dataset (CSV, JSON, or database connection)
- 3.Start with descriptive statistics: 'Summarize this dataset'
- 4.Progress to visualization: 'Create a scatter plot of X vs Y'
- 5.Advanced analysis: 'Run linear regression and interpret results'
- 6.Validate outputs: check calculations, verify visualizations make sense
- 7.Document analysis workflow for reproducibility
Common Pitfalls
- ⚠Not validating statistical assumptions before applying tests
- ⚠Accepting visualizations without checking data accuracy
- ⚠Overlooking data quality issues (missing values, outliers)
- ⚠Misinterpreting correlation as causation
- ⚠Using wrong statistical test for data distribution
- ⚠Not considering sample size and statistical power
Best Practices▌
✓ Do
- +Always validate data quality before analysis
- +Check statistical assumptions (normality, independence, etc.)
- +Visualize data before running statistical tests
- +Document analysis steps for reproducibility
- +Cross-validate findings with domain experts
- +Use skill for initial exploration, then dive deeper manually
- +Save generated code for reuse on similar datasets
✗ Don't
- −Don't trust analysis without verifying data quality
- −Don't apply statistical tests without checking assumptions
- −Don't make business decisions solely on AI-generated analysis
- −Don't ignore outliers without investigating cause
- −Don't skip data validation and sanity checks
- −Don't use for mission-critical financial or medical analysis without expert review
💡 Pro Tips
- ★Describe data context: 'This is user behavior data from e-commerce site'
- ★Ask for interpretation: 'What does this correlation mean for business?'
- ★Request multiple approaches: 'Show 3 ways to handle missing data'
- ★Combine AI analysis with domain expertise for best insights
- ★Use for rapid prototyping, then refine analysis manually
When to Use This▌
✓ Use When
Use for exploratory data analysis, data cleaning, statistical testing, visualization prototyping, and learning new analysis techniques. Best for initial exploration and rapid insights.
✗ Avoid When
Avoid for mission-critical financial analysis, medical research requiring regulatory compliance, production ML models, or when deep statistical expertise is required for nuanced interpretation.
Learning Path▌
- 1Basic: descriptive statistics, data cleaning, simple visualizations
- 2Intermediate: hypothesis testing, regression, correlation analysis
- 3Advanced: time series analysis, clustering, predictive modeling
- 4Expert: causal inference, experimental design, advanced statistical methods
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.6★★★★★30 reviews- ★★★★★Hana Jain· Dec 24, 2024
I recommend implementing-cloud-dlp-for-data-protection for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Dhruvi Jain· Dec 16, 2024
Keeps context tight: implementing-cloud-dlp-for-data-protection is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Nikhil Ghosh· Nov 23, 2024
We added implementing-cloud-dlp-for-data-protection from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Neel Lopez· Nov 15, 2024
Solid pick for teams standardizing on skills: implementing-cloud-dlp-for-data-protection is focused, and the summary matches what you get after install.
- ★★★★★Oshnikdeep· Nov 7, 2024
Registry listing for implementing-cloud-dlp-for-data-protection matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Ganesh Mohane· Oct 26, 2024
implementing-cloud-dlp-for-data-protection reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Neel Diallo· Oct 14, 2024
implementing-cloud-dlp-for-data-protection fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Nikhil Gill· Oct 6, 2024
implementing-cloud-dlp-for-data-protection has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Chinedu Robinson· Sep 1, 2024
I recommend implementing-cloud-dlp-for-data-protection for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Emma Choi· Aug 20, 2024
Useful defaults in implementing-cloud-dlp-for-data-protection — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
showing 1-10 of 30