dask

K-Dense-AI/scientific-agent-skills · updated Jun 4, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/K-Dense-AI/scientific-agent-skills --skill dask
0 commentsdiscussion
summary

### Dask

  • name: "dask"
  • description: "Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed..."
  • allowed-tools: "Read Write Edit Bash"
skill.md
name
dask
description
Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.
allowed-tools
Read Write Edit Bash
license
BSD-3-Clause license
compatibility
Requires Python 3.10+ and dask 2025.1+. DataFrame workflows need pandas 2+ and PyArrow 16+. Cloud paths (s3://, gcs://) need s3fs or gcsfs. Cluster deployment uses dask.distributed (included with dask[complete]).
metadata
version: "1.1" skill-author: K-Dense Inc.

Dask

Overview

Dask is a Python library for parallel and distributed computing that enables three critical capabilities:

  • Larger-than-memory execution on single machines for data exceeding available RAM
  • Parallel processing for improved computational speed across multiple cores
  • Distributed computation supporting terabyte-scale datasets across multiple machines

Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.

Current upstream: dask 2026.3.0 (PyPI, March 2026). Docs: docs.dask.org. Since 2025.1.0, the expression-based DataFrame API with query planning is the only implementation — do not install dask-expr separately or set dataframe.query-planning: False.

Quick Start

Installation

uv pip install "dask>=2025.1"

For a typical pandas/NumPy workflow with the distributed scheduler and dashboard:

uv pip install "dask[complete]"

Remote object storage (S3, GCS, Azure):

uv pip install s3fs    # s3:// paths
uv pip install gcsfs   # gs:// paths

Requires Python 3.10+ (3.9 support dropped in 2024.12). DataFrame I/O requires PyArrow 16+ (as of dask 2026.1.2).

When to Use This Skill

This skill should be used when:

  • Process datasets that exceed available RAM
  • Scale pandas or NumPy operations to larger datasets
  • Parallelize computations for performance improvements
  • Process multiple files efficiently (CSVs, Parquet, JSON, text logs)
  • Build custom parallel workflows with task dependencies
  • Distribute workloads across multiple cores or machines

Core Capabilities

Dask provides five main components, each suited to different use cases:

1. DataFrames - Parallel Pandas Operations

Purpose: Scale pandas operations to larger datasets through parallel processing.

When to Use:

  • Tabular data exceeds available RAM
  • Need to process multiple CSV/Parquet files together
  • Pandas operations are slow and need parallelization
  • Scaling from pandas prototype to production

Reference Documentation: For comprehensive guidance on Dask DataFrames, refer to references/dataframes.md which includes:

  • Reading data (single files, multiple files, glob patterns)
  • Common operations (filtering, groupby, joins, aggregations)
  • Custom operations with map_partitions
  • Performance optimization tips
  • Common patterns (ETL, time series, multi-file processing)

Quick Example:

import dask.dataframe as dd

# Read multiple files as single DataFrame
ddf = dd.read_csv('data/2024-*.csv')

# Operations are lazy until compute()
filtered = ddf[ddf['value'] > 100]
result = filtered.groupby('category').mean().compute()

Key Points:

  • Operations are lazy (build task graph) until .compute() called
  • Use map_partitions for efficient custom operations
  • Convert to DataFrame early when working with structured data from other sources

2. Arrays - Parallel NumPy Operations

Purpose: Extend NumPy capabilities to datasets larger than memory using blocked algorithms.

When to Use:

  • Arrays exceed available RAM
  • NumPy operations need parallelization
  • Working with scientific datasets (HDF5, Zarr, NetCDF)
  • Need parallel linear algebra or array operations

Reference Documentation: For comprehensive guidance on Dask Arrays, refer to references/arrays.md which includes:

  • Creating arrays (from NumPy, random, from disk)
  • Chunking strategies and optimization
  • Common operations (arithmetic, reductions, linear algebra)
  • Custom operations with map_blocks
  • Integration with HDF5, Zarr, and XArray

Quick Example:

import dask.array as da

# Create large array with chunks
x = da.random.random((100000, 100000), chunks=(10000, 10000))

# Operations are lazy
y = x + 100
z = y.mean(axis=0)

# Compute result
result = z.compute()

Key Points:

  • Chunk size is critical (aim for ~100 MB per chunk)
  • Operations work on chunks in parallel
  • Rechunk data when needed for efficient operations
  • Use map_blocks for operations not available in Dask

3. Bags - Parallel Processing of Unstructured Data

Purpose: Process unstructured or semi-structured data (text, JSON, logs) with functional operations.

When to Use:

  • Processing text files, logs, or JSON records
  • Data cleaning and ETL before structured analysis
  • Working with Python objects that don't fit array/dataframe formats
  • Need memory-efficient streaming processing

Reference Documentation: For comprehensive guidance on Dask Bags, refer to references/bags.md which includes:

  • Reading text and JSON files
  • Functional operations (map, filter, fold, groupby)
  • Converting to DataFrames
  • Common patterns (log analysis, JSON processing, text processing)
  • Performance considerations

Quick Example:

import dask.bag as db
import json

# Read and parse JSON files
bag = db.read_text('logs/*.json').map(json.loads)

# Filter and transform
valid = bag.filter(lambda x: x['status'] == 'valid')
processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})

# Convert to DataFrame for analysis
ddf = processed.to_dataframe()

Key Points:

  • Use for initial data cleaning, then convert to DataFrame/Array
  • Use foldby instead of groupby for better performance
  • Operations are streaming and memory-efficient
  • Convert to structured formats (DataFrame) for complex operations

4. Futures - Task-Based Parallelization

Purpose: Build custom parallel workflows with fine-grained control over task execution and dependencies.

When to Use:

  • Building dynamic, evolving workflows
  • Need immediate task execution (not lazy)
  • Computations depend on runtime conditions
  • Implementing custom parallel algorithms
  • Need stateful computations

Reference Documentation: For comprehensive guidance on Dask Futures, refer to references/futures.md which includes:

  • Setting up distributed client
  • Submitting tasks and working with futures
  • Task dependencies and data movement
  • Advanced coordination (queues, locks, events, actors)
  • Common patterns (parameter sweeps, dynamic tasks, iterative algorithms)

Quick Example:

from dask.distributed import Client

client = Client()  # Create local cluster

# Submit tasks (executes immediately)
def process(x):
    return x ** 2

futures = client.map(process, range(100))

# Gather results
results = client.gather(futures)

client.close()

Key Points:

  • Requires distributed client (even for single machine)
  • Tasks execute immediately when submitted
  • Pre-scatter large data to avoid repeated transfers
  • ~1ms overhead per task (not suitable for millions of tiny tasks)
  • Use actors for stateful workflows

5. Schedulers - Execution Backends

Purpose: Control how and where Dask tasks execute (threads, processes, distributed).

When to Choose Scheduler:

  • Threads (default): NumPy/Pandas operations, GIL-releasing libraries, shared memory benefit
  • Processes: Pure Python code, text processing, GIL-bound operations
  • Synchronous: Debugging with pdb, profiling, understanding errors
  • Distributed: Need dashboard, multi-machine clusters, advanced features

Reference Documentation: For comprehensive guidance on Dask Schedulers, refer to references/schedulers.md which includes:

  • Detailed scheduler descriptions and characteristics
  • Configuration methods (global, context manager, per-compute)
  • Performance considerations and overhead
  • Common patterns and troubleshooting
  • Thread configuration for optimal performance

Quick Example:

import dask
import dask.dataframe as dd

# Use threads for DataFrame (default, good for numeric)
ddf = dd.read_csv('data.csv')
result1 = ddf.mean().compute()  # Uses threads

# Use processes for Python-heavy work
import dask.bag as db
bag = db.read_text('logs/*.txt')
result2 = bag.map(python_function).compute(scheduler='processes')

# Use synchronous for debugging
dask.config.set(scheduler='synchronous')
result3 = problematic_computation.compute()  # Can use pdb

# Use distributed for monitoring and scaling
from dask.distributed import Client
client = Client()
result4 = computation.compute()  # Uses distributed with dashboard

Key Points:

  • Threads: Lowest overhead (~10 µs/task), best for numeric work
  • Processes: Avoids GIL (~10 ms/task), best for Python work
  • Distributed: Monitoring dashboard (~1 ms/task), scales to clusters
  • Can switch schedulers per computation or globally

Best Practices

For comprehensive performance optimization guidance, memory management strategies, and common pitfalls to avoid, refer to references/best-practices.md. Key principles include:

Start with Simpler Solutions

Before using Dask, explore:

  • Better algorithms
  • Efficient file formats (Parquet instead of CSV)
  • Compiled code (Numba, Cython)
  • Data sampling

Critical Performance Rules

1. Don't Load Data Locally Then Hand to Dask

# Wrong: Loads all data in memory first
import pandas as pd
df = pd.read_csv('large.csv')
ddf = dd.from_pandas(df, npartitions=10)

# Correct: Let Dask handle loading
import dask.dataframe as dd
ddf = dd.read_csv('large.csv')

2. Avoid Repeated compute() Calls

# Wrong: Each compute is separate
for item in items:
    result = dask_computation(item).compute()

# Correct: Single compute for all
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)

3. Don't Build Excessively Large Task Graphs

  • Increase chunk sizes if millions of tasks
  • Use map_partitions/map_blocks to fuse operations
  • Check task graph size: len(ddf.__dask_graph__())

4. Choose Appropriate Chunk Sizes

  • Target: ~100 MB per chunk (or 10 chunks per core in worker memory)
  • Too large: Memory overflow
  • Too small: Scheduling overhead

5. Use the Dashboard

from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance, identify bottlenecks

Common Workflow Patterns

ETL Pipeline

import dask.dataframe as dd

# Extract: Read data
ddf = dd.read_csv('raw_data/*.csv')

# Transform: Clean and process
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].astype('float64')
ddf = ddf.dropna(subset=['important_col'])

# Load: Aggregate and save
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']})
summary.to_parquet('output/summary.parquet')

Unstructured to Structured Pipeline

import dask.bag as db
import json

# Start with Bag for unstructured data
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')

# Convert to DataFrame for structured analysis
ddf = bag.to_dataframe()
result = ddf.groupby('category').mean().compute()

Large-Scale Array Computation

import dask.array as da

# Load or create large array
x = da.from_zarr('large_dataset.zarr')

# Process in chunks
normalized = (x - x.mean()) / x.std()

# Save result (use mode= for overwrite; zarr_array_kwargs for compression)
da.to_zarr(normalized, 'normalized.zarr', mode='w')

Custom Parallel Workflow

from dask.distributed import Client

client = Client()

# Scatter large dataset once
data = client.scatter(large_dataset)

# Process in parallel with dependencies
futures = []
for param in parameters:
    future = client.submit(process, data, param)
    futures.append(future)

# Gather results
results = client.gather(futures)

Selecting the Right Component

Use this decision guide to choose the appropriate Dask component:

Data Type:

  • Tabular data → DataFrames
  • Numeric arrays → Arrays
  • Text/JSON/logs → Bags (then convert to DataFrame)
  • Custom Python objects → Bags or Futures

Operation Type:

  • Standard pandas operations → DataFrames
  • Standard NumPy operations → Arrays
  • Custom parallel tasks → Futures
  • Text processing/ETL → Bags

Control Level:

  • High-level, automatic → DataFrames/Arrays
  • Low-level, manual → Futures

Workflow Type:

  • Static computation graph → DataFrames/Arrays/Bags
  • Dynamic, evolving → Futures

Integration Considerations

File Formats

  • Efficient: Parquet, HDF5, Zarr (columnar, compressed, parallel-friendly)
  • Compatible but slower: CSV (use for initial ingestion only)
  • For Arrays: HDF5, Zarr, NetCDF

Conversion Between Collections

# Bag → DataFrame
ddf = bag.to_dataframe()

# DataFrame → Array (for numeric data)
arr = ddf.to_dask_array(lengths=True)

# Array → DataFrame
ddf = dd.from_dask_array(arr, columns=['col1', 'col2'])

With Other Libraries

  • XArray: Wraps Dask arrays with labeled dimensions (geospatial, imaging)
  • Dask-ML: Machine learning with scikit-learn compatible APIs
  • Distributed: Advanced cluster management and monitoring

Debugging and Development

Iterative Development Workflow

  1. Test on small data with synchronous scheduler:
dask.config.set(scheduler='synchronous')
result = computation.compute()  # Can use pdb, easy debugging
  1. Validate with threads on sample:
sample = ddf.head(1000)  # Small sample
# Test logic, then scale to full dataset
  1. Scale with distributed for monitoring:
from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance
result = computation.compute()

Common Issues

Memory Errors:

  • Decrease chunk sizes
  • Use persist() strategically and delete when done
  • Check for memory leaks in custom functions

Slow Start:

  • Task graph too large (increase chunk sizes)
  • Use map_partitions or map_blocks to reduce tasks

Poor Parallelization:

  • Chunks too large (increase number of partitions)
  • Using threads with Python code (switch to processes)
  • Data dependencies preventing parallelism

Reference Files

All reference documentation files can be read as needed for detailed information:

  • references/dataframes.md - Complete Dask DataFrame guide
  • references/arrays.md - Complete Dask Array guide
  • references/bags.md - Complete Dask Bag guide
  • references/futures.md - Complete Dask Futures and distributed computing guide
  • references/schedulers.md - Complete scheduler selection and configuration guide
  • references/best-practices.md - Comprehensive performance optimization and troubleshooting

Load these files when users need detailed information about specific Dask components, operations, or patterns beyond the quick guidance provided here.

how to use dask

How to use dask on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add dask
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/K-Dense-AI/scientific-agent-skills --skill dask

The skills CLI fetches dask from GitHub repository K-Dense-AI/scientific-agent-skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/dask

Reload or restart Cursor to activate dask. Access the skill through slash commands (e.g., /dask) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

Task Automation & Efficiency

Automate repetitive workflows and reduce manual effort

Example

Generate reports, summarize documents, draft communications

Save 3-5 hours per week on routine tasks

Knowledge Enhancement

Learn new skills, understand complex topics, get expert guidance

Example

Explain concepts, provide examples, suggest learning resources

Accelerate learning and skill development by 2x

Quality Improvement

Enhance output quality through reviews, suggestions, and refinements

Example

Review drafts, suggest improvements, catch errors

Improve work quality by 30-40% with less effort

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client with skill support
  • Clear understanding of task or problem to solve
  • Willingness to iterate and refine outputs

Time Estimate

15-45 minutes depending on use case complexity

Installation Steps

  1. 1.Install skill using provided installation command
  2. 2.Test with simple use case relevant to your work
  3. 3.Evaluate output quality and relevance
  4. 4.Iterate on prompts to improve results
  5. 5.Integrate into regular workflow if valuable

Common Pitfalls

  • Expecting perfect results without iteration
  • Not providing enough context in prompts
  • Using skill for tasks outside its intended scope
  • Accepting outputs without review and validation

Best Practices

✓ Do

  • +Start with clear, specific prompts
  • +Provide relevant context and constraints
  • +Review and refine all outputs before using
  • +Iterate to improve output quality
  • +Document successful prompt patterns

✗ Don't

  • Don't use without understanding skill limitations
  • Don't skip validation of outputs
  • Don't share sensitive information in prompts
  • Don't expect skill to replace human judgment

💡 Pro Tips

  • Be specific about desired format and style
  • Ask for multiple options to choose from
  • Request explanations to understand reasoning
  • Combine AI efficiency with human expertise

When to Use This

✓ Use When

Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.

✗ Avoid When

Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.

Learning Path

  1. 1Familiarize yourself with skill capabilities and limitations
  2. 2Start with low-risk, non-critical tasks
  3. 3Progress to more complex and valuable use cases
  4. 4Build expertise through regular use and experimentation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.726 reviews
  • Kwame Martinez· Dec 28, 2024

    Registry listing for dask matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Ama Harris· Dec 8, 2024

    Solid pick for teams standardizing on skills: dask is focused, and the summary matches what you get after install.

  • Sakshi Patil· Nov 23, 2024

    Solid pick for teams standardizing on skills: dask is focused, and the summary matches what you get after install.

  • Meera Srinivasan· Nov 19, 2024

    dask reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Chaitanya Patil· Oct 14, 2024

    I recommend dask for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Naina Malhotra· Oct 10, 2024

    dask is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Mateo Desai· Sep 25, 2024

    We added dask from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Piyush G· Sep 17, 2024

    dask fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Hana Kapoor· Aug 16, 2024

    Keeps context tight: dask is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Shikha Mishra· Aug 8, 2024

    dask has been reliable in day-to-day use. Documentation quality is above average for community skills.

showing 1-10 of 26

1 / 3