polars-bio

K-Dense-AI/scientific-agent-skills · updated Jun 4, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/K-Dense-AI/scientific-agent-skills --skill polars-bio
0 commentsdiscussion
summary

### Polars Bio

  • name: "polars-bio"
  • description: "High-performance genomic interval operations and bioinformatics file I/O on Polars DataFrames. Overlap, nearest, merge, coverage, complement, subtract for BED/VCF/BAM/GFF intervals. Streaming, cloud-n..."
  • allowed-tools: "Read Write Edit Bash"
skill.md
name
polars-bio
description
High-performance genomic interval operations and bioinformatics file I/O on Polars DataFrames. Overlap, nearest, merge, coverage, complement, subtract for BED/VCF/BAM/GFF intervals. Streaming, cloud-native, faster bioframe alternative.
license
Apache-2.0
allowed-tools
Read Write Edit Bash
compatibility
Requires Python 3.11–3.14 and polars-bio (uv pip install). Cloud I/O uses standard AWS/GCS/Azure SDK env vars when paths use s3://, gs://, or az:// URIs.
metadata
version: "1.0" skill-author: K-Dense Inc.

polars-bio

Overview

polars-bio is a high-performance Python library for genomic interval operations and bioinformatics file I/O, built on Polars, Apache Arrow, and Apache DataFusion. It provides a familiar DataFrame-centric API for interval arithmetic (overlap, nearest, merge, coverage, complement, subtract) and reading/writing common bioinformatics formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ).

Key value propositions:

  • 6-38x faster than bioframe on real-world genomic benchmarks
  • Streaming/out-of-core support for large genomes via DataFusion
  • Cloud-native file I/O (S3, GCS, Azure) with predicate pushdown
  • Two API styles: functional (pb.overlap(df1, df2)) and method-chaining (df1.lazy().pb.overlap(df2))
  • SQL interface for genomic data via DataFusion SQL engine

When to Use This Skill

Use this skill when:

  • Performing genomic interval operations (overlap, nearest, merge, coverage, complement, subtract)
  • Reading/writing bioinformatics file formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ)
  • Processing large genomic datasets that don't fit in memory (streaming mode)
  • Running SQL queries on genomic data files
  • Migrating from bioframe to a faster alternative
  • Computing read depth/pileup from BAM/CRAM files
  • Working with Polars DataFrames containing genomic intervals

Quick Start

Installation

Requires Python 3.11–3.14 (see PyPI).

uv pip install "polars-bio==0.31.0"

For pandas compatibility (pandas ≥3.0):

uv pip install "polars-bio[pandas]==0.31.0"

Basic Overlap Example

import polars as pl
import polars_bio as pb

# Create two interval DataFrames
df1 = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr1"],
    "start": [1, 5, 22],
    "end":   [6, 9, 30],
})

df2 = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [3, 25],
    "end":   [8, 28],
})

# Functional API (returns LazyFrame by default)
result = pb.overlap(df1, df2)
result_df = result.collect()

# Get a DataFrame directly
result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")

# Method-chaining API (via .pb accessor on LazyFrame)
result = df1.lazy().pb.overlap(df2)
result_df = result.collect()

Reading a BED File

import polars_bio as pb

# Eager read (loads entire file)
df = pb.read_bed("regions.bed")

# Lazy scan (streaming, for large files)
lf = pb.scan_bed("regions.bed")
result = lf.collect()

Core Capabilities

1. Genomic Interval Operations

polars-bio provides 8 core interval operations for genomic range arithmetic. All operations accept Polars DataFrames with chrom, start, end columns (configurable). All operations return a LazyFrame by default (use output_type="polars.DataFrame" for eager results).

Operations:

  • overlap / count_overlaps - Find or count overlapping intervals between two sets (overlap_output="left" returns df1-only hits since 0.30.0)
  • nearest - Find nearest intervals (with configurable k, overlap, distance params)
  • merge - Merge overlapping/bookended intervals within a set
  • cluster - Assign cluster IDs to overlapping intervals
  • coverage - Compute per-interval coverage counts (two-input operation)
  • complement - Find gaps between intervals within a genome
  • subtract - Remove portions of intervals that overlap another set

Example:

import polars_bio as pb

# Find overlapping intervals (returns LazyFrame)
result = pb.overlap(df1, df2, suffixes=("_1", "_2"))

# Count overlaps per interval
counts = pb.count_overlaps(df1, df2)

# Merge overlapping intervals
merged = pb.merge(df1)

# Find nearest intervals
nearest = pb.nearest(df1, df2)

# Collect any LazyFrame result to DataFrame
result_df = result.collect()

Reference: See references/interval_operations.md for detailed documentation on all operations, parameters, output schemas, and performance considerations.

2. Bioinformatics File I/O

Read and write common bioinformatics formats with read_*, scan_*, write_*, and sink_* functions. Supports cloud storage (S3, GCS, Azure) and compression (GZIP, BGZF).

Supported formats:

  • BED - Genomic intervals (read_bed, scan_bed, write_* via generic)
  • VCF - Genetic variants (read_vcf, scan_vcf, write_vcf, sink_vcf)
  • VCF Zarr - Analysis-ready Zarr stores (read_vcf_zarr, scan_vcf_zarr; local directory paths)
  • BAM - Aligned reads (read_bam, scan_bam, write_bam, sink_bam)
  • CRAM - Compressed alignments (read_cram, scan_cram, write_cram, sink_cram)
  • GFF - Gene annotations (read_gff, scan_gff)
  • GTF - Gene annotations (read_gtf, scan_gtf)
  • FASTA - Reference sequences (read_fasta, scan_fasta, write_fasta, sink_fasta)
  • FASTQ - Sequencing reads (read_fastq, scan_fastq, write_fastq, sink_fastq)
  • SAM - Text alignments (read_sam, scan_sam, write_sam, sink_sam)
  • Hi-C pairs - Chromatin contacts (read_pairs, scan_pairs)

Example:

import polars_bio as pb

# Read VCF file
variants = pb.read_vcf("samples.vcf.gz")

# Lazy scan BAM file (streaming)
alignments = pb.scan_bam("aligned.bam")

# Read GFF annotations
genes = pb.read_gff("annotations.gff3")

# Cloud storage (individual params, not a dict)
df = pb.read_bed("s3://bucket/regions.bed",
                 allow_anonymous=True)

Reference: See references/file_io.md for per-format column schemas, parameters, cloud storage options, and compression support.

3. SQL Data Processing

Register bioinformatics files as tables and query them using DataFusion SQL. Combines the power of SQL with polars-bio's genomic-aware readers.

import polars as pl
import polars_bio as pb

# Register files as SQL tables (path first, name= keyword)
pb.register_vcf("samples.vcf.gz", name="variants")
pb.register_bed("target_regions.bed", name="regions")

# Query with SQL (returns LazyFrame)
result = pb.sql("SELECT chrom, start, end, ref, alt FROM variants WHERE qual > 30")
result_df = result.collect()

# Register a Polars DataFrame as a SQL table
pb.from_polars("my_intervals", df)
result = pb.sql("SELECT * FROM my_intervals WHERE chrom = 'chr1'").collect()

Reference: See references/sql_processing.md for register functions, SQL syntax, and examples.

4. Pileup Operations

Compute per-base read depth from BAM/CRAM files with CIGAR-aware depth calculation.

import polars_bio as pb

# Compute depth across a BAM file
depth_lf = pb.depth("aligned.bam")
depth_df = depth_lf.collect()

# With quality filter
depth_lf = pb.depth("aligned.bam", min_mapping_quality=20)

Reference: See references/pileup_operations.md for parameters and integration patterns.

Key Concepts

Coordinate Systems

polars-bio defaults to 1-based coordinates (genomic convention). This can be changed globally:

import polars_bio as pb

# Switch to 0-based half-open coordinates (default is 1-based / False)
pb.set_option("datafusion.bio.coordinate_system_zero_based", True)

# Switch back to 1-based (default)
pb.set_option("datafusion.bio.coordinate_system_zero_based", False)

I/O functions also accept use_zero_based to set coordinate metadata on the resulting DataFrame:

# Read BED with explicit 0-based metadata
df = pb.read_bed("regions.bed", use_zero_based=True)

Important: BED files are always 0-based half-open in the file format. polars-bio handles the conversion automatically when reading BED files. Coordinate metadata is attached to DataFrames by I/O functions and propagated through operations.

Two API Styles

Functional API - standalone functions, explicit inputs:

result = pb.overlap(df1, df2, suffixes=("_1", "_2"))
merged = pb.merge(df)

Method-chaining API - via .pb accessor on LazyFrames (not DataFrames):

result = df1.lazy().pb.overlap(df2)
merged = df.lazy().pb.merge()

Important: The .pb accessor for interval operations is only available on LazyFrame. On DataFrame, .pb provides write operations only (write_bam, write_vcf, etc.).

Method-chaining enables fluent pipelines:

# Chain interval operations (note: overlap outputs suffixed columns,
# so rename before merge which expects chrom/start/end)
result = (
    df1.lazy()
    .pb.overlap(df2)
    .filter(pl.col("start_2") > 1000)
    .select(
        pl.col("chrom_1").alias("chrom"),
        pl.col("start_1").alias("start"),
        pl.col("end_1").alias("end"),
    )
    .pb.merge()
    .collect()
)

Probe-Build Architecture

For two-input operations (overlap, nearest, count_overlaps, coverage), polars-bio uses a probe-build join strategy:

  • The first DataFrame is the probe (iterated over)
  • The second DataFrame is the build (indexed for lookup)

For best performance, pass the larger DataFrame as the first argument (probe) and the smaller one as the second (build).

Column Conventions

By default, polars-bio expects columns named chrom, start, end. Custom column names can be specified via lists:

result = pb.overlap(
    df1, df2,
    cols1=["chromosome", "begin", "finish"],
    cols2=["chr", "pos_start", "pos_end"],
)

Return Types and Collecting Results

All interval operations and pb.sql() return a LazyFrame by default. Use .collect() to materialize results, or pass output_type="polars.DataFrame" for eager evaluation:

# Lazy (default) - collect when needed
result_lf = pb.overlap(df1, df2)
result_df = result_lf.collect()

# Eager - get DataFrame directly
result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")

Streaming and Out-of-Core Processing

For datasets larger than available RAM, use scan_* functions and streaming execution:

# Scan files lazily
lf = pb.scan_bed("large_intervals.bed")

# Process with Polars streaming (requires polars ≥1.37, bundled with polars-bio)
result = lf.collect(engine="streaming")

DataFusion streaming is enabled by default for interval operations, processing data in batches without loading the full dataset into memory.

Common Pitfalls

  1. .pb accessor on DataFrame vs LazyFrame: Interval operations (overlap, merge, etc.) are only on LazyFrame.pb. DataFrame.pb only has write methods. Use .lazy() to convert before chaining interval ops.

  2. LazyFrame returns: All interval operations and pb.sql() return LazyFrame by default. Don't forget .collect() or use output_type="polars.DataFrame".

  3. Column name mismatches: polars-bio expects chrom, start, end by default. Use cols1/cols2 parameters (as lists) if your columns have different names.

  4. Coordinate system metadata: Interval operations read coordinate metadata from I/O functions or DataFrame config_meta. For manually built DataFrames, set df.config_meta.set(coordinate_system_zero_based=True) (0-based) or False (1-based). If metadata is missing, polars-bio falls back to the global datafusion.bio.coordinate_system_zero_based setting (with a warning). Set pb.set_option("datafusion.bio.coordinate_system_check", True) to raise MissingCoordinateSystemError instead. Mismatched systems between inputs raise CoordinateSystemMismatchError.

  5. Probe-build order matters: For overlap, nearest, and coverage, the first DataFrame is probed against the second. Swapping arguments changes which intervals appear in the left vs right output columns, and can affect performance.

  6. INT32 position limit: Genomic positions are stored as 32-bit integers, limiting coordinates to ~2.1 billion. This is sufficient for all known genomes but may be an issue with custom coordinate spaces.

  7. BAM index requirements: read_bam and scan_bam require a .bai index file alongside the BAM. Create one with samtools index if missing.

  8. Parallel execution disabled by default: DataFusion parallelism defaults to 1 partition. Enable for large datasets:

    pb.set_option("datafusion.execution.target_partitions", 8)
    
  9. CRAM has separate functions: Use read_cram/scan_cram/register_cram for CRAM files (not read_bam). CRAM functions require a reference_path parameter.

Best Practices

  1. Use scan_* for large files: Prefer scan_bed, scan_vcf, etc. over read_* for files larger than available RAM. Scan functions enable streaming and predicate pushdown.

  2. Configure parallelism for large datasets:

    import os
    pb.set_option("datafusion.execution.target_partitions", os.cpu_count())
    
  3. Use BGZF compression: BGZF-compressed files (.bed.gz, .vcf.gz) support parallel block decompression, significantly faster than plain GZIP.

  4. Select columns early: When only specific columns are needed, select them early to reduce memory usage:

    df = pb.read_vcf("large.vcf.gz").select("chrom", "start", "end", "ref", "alt")
    
  5. Use cloud paths directly: Pass S3/GCS/Azure URIs directly to read/scan/register functions instead of downloading files first. Authenticated access uses your cloud SDK credentials (AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY, GOOGLE_APPLICATION_CREDENTIALS, Azure defaults) only when those cloud paths are accessed:

    df = pb.read_bed("s3://my-bucket/regions.bed", allow_anonymous=True)
    
  6. Prefer functional API for single operations, method-chaining for pipelines: Use pb.overlap() for one-off operations and .lazy().pb.overlap() when building multi-step pipelines.

Resources

references/

Detailed documentation for each major capability:

  • interval_operations.md - All 8 interval operations with parameters, examples, output schemas, and performance tips. Core reference for genomic range arithmetic.

  • file_io.md - Supported formats table, per-format column schemas, cloud storage configuration, compression support, and common parameters.

  • sql_processing.md - Register functions, DataFusion SQL syntax, combining SQL with interval operations, and example queries.

  • pileup_operations.md - Per-base read depth computation from BAM/CRAM files, parameters, and integration with interval operations.

  • configuration.md - Global settings (parallelism, coordinate systems, streaming modes), logging, and metadata management.

  • bioframe_migration.md - Operation mapping table, API differences, performance comparison, migration code examples, and pandas compatibility mode.

how to use polars-bio

How to use polars-bio on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add polars-bio
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/K-Dense-AI/scientific-agent-skills --skill polars-bio

The skills CLI fetches polars-bio from GitHub repository K-Dense-AI/scientific-agent-skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/polars-bio

Reload or restart Cursor to activate polars-bio. Access the skill through slash commands (e.g., /polars-bio) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

Task Automation & Efficiency

Automate repetitive workflows and reduce manual effort

Example

Generate reports, summarize documents, draft communications

Save 3-5 hours per week on routine tasks

Knowledge Enhancement

Learn new skills, understand complex topics, get expert guidance

Example

Explain concepts, provide examples, suggest learning resources

Accelerate learning and skill development by 2x

Quality Improvement

Enhance output quality through reviews, suggestions, and refinements

Example

Review drafts, suggest improvements, catch errors

Improve work quality by 30-40% with less effort

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client with skill support
  • Clear understanding of task or problem to solve
  • Willingness to iterate and refine outputs

Time Estimate

15-45 minutes depending on use case complexity

Installation Steps

  1. 1.Install skill using provided installation command
  2. 2.Test with simple use case relevant to your work
  3. 3.Evaluate output quality and relevance
  4. 4.Iterate on prompts to improve results
  5. 5.Integrate into regular workflow if valuable

Common Pitfalls

  • Expecting perfect results without iteration
  • Not providing enough context in prompts
  • Using skill for tasks outside its intended scope
  • Accepting outputs without review and validation

Best Practices

✓ Do

  • +Start with clear, specific prompts
  • +Provide relevant context and constraints
  • +Review and refine all outputs before using
  • +Iterate to improve output quality
  • +Document successful prompt patterns

✗ Don't

  • Don't use without understanding skill limitations
  • Don't skip validation of outputs
  • Don't share sensitive information in prompts
  • Don't expect skill to replace human judgment

💡 Pro Tips

  • Be specific about desired format and style
  • Ask for multiple options to choose from
  • Request explanations to understand reasoning
  • Combine AI efficiency with human expertise

When to Use This

✓ Use When

Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.

✗ Avoid When

Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.

Learning Path

  1. 1Familiarize yourself with skill capabilities and limitations
  2. 2Start with low-risk, non-critical tasks
  3. 3Progress to more complex and valuable use cases
  4. 4Build expertise through regular use and experimentation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.767 reviews
  • Nikhil Chen· Dec 24, 2024

    Registry listing for polars-bio matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Li Martinez· Dec 20, 2024

    Useful defaults in polars-bio — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Layla Perez· Dec 20, 2024

    polars-bio has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Kaira Robinson· Dec 20, 2024

    polars-bio fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Mei Johnson· Dec 12, 2024

    polars-bio reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Nikhil Ramirez· Dec 8, 2024

    We added polars-bio from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Kwame Martinez· Nov 27, 2024

    Useful defaults in polars-bio — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Nikhil Park· Nov 15, 2024

    polars-bio reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Yash Thakker· Nov 11, 2024

    Keeps context tight: polars-bio is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • James Liu· Nov 11, 2024

    We added polars-bio from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

showing 1-10 of 67

1 / 7