tiledbvcf▌
K-Dense-AI/scientific-agent-skills · updated Jun 4, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
### Tiledbvcf
- ›name: "tiledbvcf"
- ›description: "Efficient storage and retrieval of genomic variant data using TileDB. Scalable VCF/BCF ingestion, incremental sample addition, compressed storage, parallel queries, and export capabilities for populat..."
| name | tiledbvcf |
| description | Efficient storage and retrieval of genomic variant data using TileDB. Scalable VCF/BCF ingestion, incremental sample addition, compressed storage, parallel queries, and export capabilities for population genomics. |
| license | MIT license |
| metadata | version: "1.0" skill-author: Jeremy Leipzig |
TileDB-VCF
Overview
TileDB-VCF is a high-performance C++ library with Python and CLI interfaces for efficient storage and retrieval of genomic variant-call data. Built on TileDB's sparse array technology, it enables scalable ingestion of VCF/BCF files, incremental sample addition without expensive merging operations, and efficient parallel queries of variant data stored locally or in the cloud.
When to Use This Skill
This skill should be used when:
- Learning TileDB-VCF concepts and workflows
- Prototyping genomics analyses and pipelines
- Working with small-to-medium datasets (< 1000 samples)
- Need incremental addition of new samples to existing datasets
- Require efficient querying of specific genomic regions across many samples
- Working with cloud-stored variant data (S3, Azure, GCS)
- Need to export subsets of large VCF datasets
- Building variant databases for cohort studies
- Educational projects and method development
- Performance is critical for variant data operations
Quick Start
Installation
Preferred Method: Conda/Mamba
# Enter the following two lines if you are on a M1 Mac
CONDA_SUBDIR=osx-64
conda config --env --set subdir osx-64
# Create the conda environment
conda create -n tiledb-vcf "python<3.10"
conda activate tiledb-vcf
# Mamba is a faster and more reliable alternative to conda
conda install -c conda-forge mamba
# Install TileDB-Py and TileDB-VCF, align with other useful libraries
mamba install -y -c conda-forge -c bioconda -c tiledb tiledb-py tiledbvcf-py pandas pyarrow numpy
Alternative: Docker Images
docker pull tiledb/tiledbvcf-py # Python interface
docker pull tiledb/tiledbvcf-cli # Command-line interface
Basic Examples
Create and populate a dataset:
import tiledbvcf
# Create a new dataset
ds = tiledbvcf.Dataset(uri="my_dataset", mode="w",
cfg=tiledbvcf.ReadConfig(memory_budget=1024))
# Ingest VCF files (must be single-sample with indexes)
# Requirements:
# - VCFs must be single-sample (not multi-sample)
# - Must have indexes: .csi (bcftools) or .tbi (tabix)
ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])
Query variant data:
# Open existing dataset for reading
ds = tiledbvcf.Dataset(uri="my_dataset", mode="r")
# Query specific regions and samples
df = ds.read(
attrs=["sample_name", "pos_start", "pos_end", "alleles", "fmt_GT"],
regions=["chr1:1000000-2000000", "chr2:500000-1500000"],
samples=["sample1", "sample2", "sample3"]
)
print(df.head())
Export to VCF:
import os
# Export two VCF samples
ds.export(
regions=["chr21:8220186-8405573"],
samples=["HG00101", "HG00097"],
output_format="v",
output_dir=os.path.expanduser("~"),
)
Core Capabilities
1. Dataset Creation and Ingestion
Create TileDB-VCF datasets and incrementally ingest variant data from multiple VCF/BCF files. This is appropriate for building population genomics databases and cohort studies.
Requirements:
- Single-sample VCFs only: Multi-sample VCFs are not supported
- Index files required: VCF/BCF files must have indexes (.csi or .tbi)
Common operations:
- Create new datasets with optimized array schemas
- Ingest single or multiple VCF/BCF files in parallel
- Add new samples incrementally without re-processing existing data
- Configure memory usage and compression settings
- Handle various VCF formats and INFO/FORMAT fields
- Resume interrupted ingestion processes
- Validate data integrity during ingestion
2. Efficient Querying and Filtering
Query variant data with high performance across genomic regions, samples, and variant attributes. This is appropriate for association studies, variant discovery, and population analysis.
Common operations:
- Query specific genomic regions (single or multiple)
- Filter by sample names or sample groups
- Extract specific variant attributes (position, alleles, genotypes, quality)
- Access INFO and FORMAT fields efficiently
- Combine spatial and attribute-based filtering
- Stream large query results
- Perform aggregations across samples or regions
3. Data Export and Interoperability
Export data in various formats for downstream analysis or integration with other genomics tools. This is appropriate for sharing datasets, creating analysis subsets, or feeding other pipelines.
Common operations:
- Export to standard VCF/BCF formats
- Generate TSV files with selected fields
- Create sample/region-specific subsets
- Maintain data provenance and metadata
- Lossless data export preserving all annotations
- Compressed output formats
- Streaming exports for large datasets
4. Population Genomics Workflows
TileDB-VCF excels at large-scale population genomics analyses requiring efficient access to variant data across many samples and genomic regions.
Common workflows:
- Genome-wide association studies (GWAS) data preparation
- Rare variant burden testing
- Population stratification analysis
- Allele frequency calculations across populations
- Quality control across large cohorts
- Variant annotation and filtering
- Cross-population comparative analysis
Key Concepts
Array Schema and Data Model
TileDB-VCF Data Model:
- Variants stored as sparse arrays with genomic coordinates as dimensions
- Samples stored as attributes allowing efficient sample-specific queries
- INFO and FORMAT fields preserved with original data types
- Automatic compression and chunking for optimal storage
Schema Configuration:
# Custom schema with specific tile extents
config = tiledbvcf.ReadConfig(
memory_budget=2048, # MB
region_partition=(0, 3095677412), # Full genome
sample_partition=(0, 10000) # Up to 10k samples
)
Coordinate Systems and Regions
Critical: TileDB-VCF uses 1-based genomic coordinates following VCF standard:
- Positions are 1-based (first base is position 1)
- Ranges are inclusive on both ends
- Region "chr1:1000-2000" includes positions 1000-2000 (1001 bases total)
Region specification formats:
# Single region
regions = ["chr1:1000000-2000000"]
# Multiple regions
regions = ["chr1:1000000-2000000", "chr2:500000-1500000"]
# Whole chromosome
regions = ["chr1"]
# BED-style (0-based, half-open converted internally)
regions = ["chr1:999999-2000000"] # Equivalent to 1-based chr1:1000000-2000000
Memory Management
Performance considerations:
- Set appropriate memory budget based on available system memory
- Use streaming queries for very large result sets
- Partition large ingestions to avoid memory exhaustion
- Configure tile cache for repeated region access
- Use parallel ingestion for multiple files
- Optimize region queries by combining nearby regions
Cloud Storage Integration
TileDB-VCF seamlessly works with cloud storage:
# S3 dataset
ds = tiledbvcf.Dataset(uri="s3://bucket/dataset", mode="r")
# Azure Blob Storage
ds = tiledbvcf.Dataset(uri="azure://container/dataset", mode="r")
# Google Cloud Storage
ds = tiledbvcf.Dataset(uri="gcs://bucket/dataset", mode="r")
Common Pitfalls
- Memory exhaustion during ingestion: Use appropriate memory budget and batch processing for large VCF files
- Inefficient region queries: Combine nearby regions instead of many separate queries
- Missing sample names: Ensure sample names in VCF headers match query sample specifications
- Coordinate system confusion: Remember TileDB-VCF uses 1-based coordinates like VCF standard
- Large result sets: Use streaming or pagination for queries returning millions of variants
- Cloud permissions: Ensure proper authentication for cloud storage access
- Concurrent access: Multiple writers to the same dataset can cause corruption—use appropriate locking
CLI Usage
TileDB-VCF provides a command-line interface with the following subcommands:
Available Subcommands:
create- Creates an empty TileDB-VCF datasetstore- Ingests samples into a TileDB-VCF datasetexport- Exports data from a TileDB-VCF datasetlist- Lists all sample names present in a TileDB-VCF datasetstat- Prints high-level statistics about a TileDB-VCF datasetutils- Utils for working with a TileDB-VCF datasetversion- Print the version information and exit
# Create empty dataset
tiledbvcf create --uri my_dataset
# Ingest samples (requires single-sample VCFs with indexes)
tiledbvcf store --uri my_dataset --samples sample1.vcf.gz,sample2.vcf.gz
# Export data
tiledbvcf export --uri my_dataset \
--regions "chr1:1000000-2000000" \
--sample-names "sample1,sample2"
# List all samples
tiledbvcf list --uri my_dataset
# Show dataset statistics
tiledbvcf stat --uri my_dataset
Advanced Features
Allele Frequency Analysis
# Calculate allele frequencies
af_df = tiledbvcf.read_allele_frequency(
uri="my_dataset",
regions=["chr1:1000000-2000000"],
samples=["sample1", "sample2", "sample3"]
)
Sample Quality Control
# Perform sample QC
qc_results = tiledbvcf.sample_qc(
uri="my_dataset",
samples=["sample1", "sample2"]
)
Custom Configurations
# Advanced configuration
config = tiledbvcf.ReadConfig(
memory_budget=4096,
tiledb_config={
"sm.tile_cache_size": "1000000000",
"vfs.s3.region": "us-east-1"
}
)
Resources
Getting Help
Open Source TileDB-VCF Resources
Open Source Documentation:
- TileDB Academy: https://cloud.tiledb.com/academy/
- Population Genomics Guide: https://cloud.tiledb.com/academy/structure/life-sciences/population-genomics/
- TileDB-VCF GitHub: https://github.com/TileDB-Inc/TileDB-VCF
TileDB-Cloud Resources
For Large-Scale/Production Genomics:
- TileDB-Cloud Platform: https://cloud.tiledb.com
- TileDB Academy (All Documentation): https://cloud.tiledb.com/academy/
Getting Started:
- Free account signup: https://cloud.tiledb.com
- Contact: [email protected] for enterprise needs
Scaling to TileDB-Cloud
When your genomics workloads outgrow single-node processing, TileDB-Cloud provides enterprise-scale capabilities for production genomics pipelines.
Note: This section covers TileDB-Cloud capabilities based on available documentation. For complete API details and current functionality, consult the official TileDB-Cloud documentation and API reference.
Setting Up TileDB-Cloud
1. Create Account and Get API Token
# Sign up at https://cloud.tiledb.com
# Generate API token in your account settings
2. Install TileDB-Cloud Python Client
# Base installation
pip install tiledb-cloud
# With genomics-specific functionality
pip install tiledb-cloud[life-sciences]
3. Configure Authentication
# Set environment variable with your API token
export TILEDB_REST_TOKEN="your_api_token"
import tiledb.cloud
# Authentication is automatic via TILEDB_REST_TOKEN
# No explicit login required in code
Migrating from Open Source to TileDB-Cloud
Large-Scale Ingestion
# TileDB-Cloud: Distributed VCF ingestion
import tiledb.cloud.vcf
# Use specialized VCF ingestion module
# Note: Exact API requires TileDB-Cloud documentation
# This represents the available functionality structure
tiledb.cloud.vcf.ingestion.ingest_vcf_dataset(
source="s3://my-bucket/vcf-files/",
output="tiledb://my-namespace/large-dataset",
namespace="my-namespace",
acn="my-s3-credentials",
ingest_resources={"cpu": "16", "memory": "64Gi"}
)
Distributed Query Processing
# TileDB-Cloud: VCF querying across distributed storage
import tiledb.cloud.vcf
import tiledbvcf
# Define the dataset URI
dataset_uri = "tiledb://TileDB-Inc/gvcf-1kg-dragen-v376"
# Get all samples from the dataset
ds = tiledbvcf.Dataset(dataset_uri, tiledb_config=cfg)
samples = ds.samples()
# Define attributes and ranges to query on
attrs = ["sample_name", "fmt_GT", "fmt_AD", "fmt_DP"]
regions = ["chr13:32396898-32397044", "chr13:32398162-32400268"]
# Perform the read, which is executed in a distributed fashion
df = tiledb.cloud.vcf.read(
dataset_uri=dataset_uri,
regions=regions,
samples=samples,
attrs=attrs,
namespace="my-namespace", # specifies which account to charge
)
df.to_pandas()
Enterprise Features
Data Sharing and Collaboration
# TileDB-Cloud provides enterprise data sharing capabilities
# through namespace-based permissions and group management
# Access shared datasets via TileDB-Cloud URIs
dataset_uri = "tiledb://shared-namespace/population-study"
# Collaborate through shared notebooks and compute resources
# (Specific API requires TileDB-Cloud documentation)
Cost Optimization
- Serverless Compute: Pay only for actual compute time
- Auto-scaling: Automatically scale up/down based on workload
- Spot Instances: Use cost-optimized compute for batch jobs
- Data Tiering: Automatic hot/cold storage management
Security and Compliance
- End-to-end Encryption: Data encrypted in transit and at rest
- Access Controls: Fine-grained permissions and audit logs
- HIPAA/SOC2 Compliance: Enterprise security standards
- VPC Support: Deploy in private cloud environments
When to Migrate Checklist
✅ Migrate to TileDB-Cloud if you have:
- Datasets > 1000 samples
- Need to process > 100GB of VCF data
- Require distributed computing
- Multiple team members need access
- Need enterprise security/compliance
- Want cost-optimized serverless compute
- Require 24/7 production uptime
Getting Started with TileDB-Cloud
- Start Free: TileDB-Cloud offers free tier for evaluation
- Migration Support: TileDB team provides migration assistance
- Training: Access to genomics-specific tutorials and examples
- Professional Services: Custom deployment and optimization
Next Steps:
- Visit https://cloud.tiledb.com to create account
- Review documentation at https://cloud.tiledb.com/academy/
- Contact [email protected] for enterprise needs
How to use tiledbvcf on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your development machine
- ›Node.js version 16.0+ with npm package manager (verify with
node --version) - ›Active project directory or workspace where you want to add tiledbvcf
Execute installation command
Execute the skills CLI command in your project's root directory to begin installation:
The skills CLI fetches tiledbvcf from GitHub repository K-Dense-AI/scientific-agent-skills and configures it for Cursor.
Select Cursor when prompted
The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Reload or restart Cursor to activate tiledbvcf. Access the skill through slash commands (e.g., /tiledbvcf) or your agent's skill management interface.
Security & Verification Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases▌
Task Automation & Efficiency
Automate repetitive workflows and reduce manual effort
Example
Generate reports, summarize documents, draft communications
Save 3-5 hours per week on routine tasks
Knowledge Enhancement
Learn new skills, understand complex topics, get expert guidance
Example
Explain concepts, provide examples, suggest learning resources
Accelerate learning and skill development by 2x
Quality Improvement
Enhance output quality through reviews, suggestions, and refinements
Example
Review drafts, suggest improvements, catch errors
Improve work quality by 30-40% with less effort
Implementation Guide▌
Prerequisites
- ›Claude Desktop or compatible AI client with skill support
- ›Clear understanding of task or problem to solve
- ›Willingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Installation Steps
- 1.Install skill using provided installation command
- 2.Test with simple use case relevant to your work
- 3.Evaluate output quality and relevance
- 4.Iterate on prompts to improve results
- 5.Integrate into regular workflow if valuable
Common Pitfalls
- ⚠Expecting perfect results without iteration
- ⚠Not providing enough context in prompts
- ⚠Using skill for tasks outside its intended scope
- ⚠Accepting outputs without review and validation
Best Practices▌
✓ Do
- +Start with clear, specific prompts
- +Provide relevant context and constraints
- +Review and refine all outputs before using
- +Iterate to improve output quality
- +Document successful prompt patterns
✗ Don't
- −Don't use without understanding skill limitations
- −Don't skip validation of outputs
- −Don't share sensitive information in prompts
- −Don't expect skill to replace human judgment
💡 Pro Tips
- ★Be specific about desired format and style
- ★Ask for multiple options to choose from
- ★Request explanations to understand reasoning
- ★Combine AI efficiency with human expertise
When to Use This▌
✓ Use When
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
✗ Avoid When
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path▌
- 1Familiarize yourself with skill capabilities and limitations
- 2Start with low-risk, non-critical tasks
- 3Progress to more complex and valuable use cases
- 4Build expertise through regular use and experimentation
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.5★★★★★58 reviews- ★★★★★Shikha Mishra· Dec 28, 2024
Useful defaults in tiledbvcf — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Olivia Perez· Dec 20, 2024
We added tiledbvcf from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Naina Verma· Dec 16, 2024
I recommend tiledbvcf for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Kabir Verma· Dec 16, 2024
tiledbvcf fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Ama Khan· Dec 8, 2024
We added tiledbvcf from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Ava Okafor· Nov 27, 2024
tiledbvcf reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Carlos Taylor· Nov 23, 2024
Useful defaults in tiledbvcf — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Noah Dixit· Nov 11, 2024
tiledbvcf reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Valentina Sethi· Nov 7, 2024
Registry listing for tiledbvcf matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Valentina Taylor· Oct 26, 2024
tiledbvcf reduced setup friction for our internal harness; good balance of opinion and flexibility.
showing 1-10 of 58