scikit-learn

K-Dense-AI/scientific-agent-skills · updated Jun 4, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/K-Dense-AI/scientific-agent-skills --skill scikit-learn
0 commentsdiscussion
summary

### Scikit Learn

  • name: "scikit-learn"
  • description: "Machine learning in Python with scikit-learn. Use when working with supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), model evaluation, hy..."
  • allowed-tools: "Read Write Edit Bash"
skill.md
name
scikit-learn
description
Machine learning in Python with scikit-learn. Use when working with supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), model evaluation, hyperparameter tuning, preprocessing, or building ML pipelines. Provides comprehensive reference documentation for algorithms, preprocessing techniques, pipelines, and best practices.
license
BSD-3-Clause license
allowed-tools
Read Write Edit Bash
compatibility
Requires Python 3.11+ and scikit-learn 1.7+. NumPy and SciPy are required dependencies. Optional matplotlib/seaborn for bundled example scripts that save plots.
metadata
version: "1.1" skill-author: K-Dense Inc.

Scikit-learn

Overview

This skill provides comprehensive guidance for machine learning tasks using scikit-learn, the industry-standard Python library for classical machine learning. Use this skill for classification, regression, clustering, dimensionality reduction, preprocessing, model evaluation, and building production-ready ML pipelines.

Installation

Tested against scikit-learn 1.8.0 (stable; December 2025). Requires Python 3.11–3.14 (free-threaded CPython 3.14 wheels available in 1.8+).

Install the PyPI package scikit-learn (not the deprecated sklearn package on PyPI). Import in code as sklearn.

# Install scikit-learn using uv
uv pip install "scikit-learn>=1.7"

# Optional: plotting utilities and bundled script dependencies
uv pip install "scikit-learn[plots]" matplotlib seaborn

# Commonly used with
uv pip install pandas numpy

Check your version:

import sklearn
print(sklearn.__version__)

When to Use This Skill

Use the scikit-learn skill when:

  • Building classification or regression models
  • Performing clustering or dimensionality reduction
  • Preprocessing and transforming data for machine learning
  • Evaluating model performance with cross-validation
  • Tuning hyperparameters with grid or random search
  • Creating ML pipelines for production workflows
  • Comparing different algorithms for a task
  • Working with both structured (tabular) and text data
  • Need interpretable, classical machine learning approaches

Quick Start

Classification Example

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

Complete Pipeline with Mixed Data

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# Define feature types
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']

# Create preprocessing pipelines
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Core Capabilities

1. Supervised Learning

Comprehensive algorithms for classification and regression tasks.

Key algorithms:

  • Linear models: Logistic Regression, Linear Regression, Ridge, Lasso, ElasticNet
  • Tree-based: Decision Trees, Random Forest, Gradient Boosting
  • Support Vector Machines: SVC, SVR with various kernels
  • Ensemble methods: AdaBoost, Voting, Stacking
  • Neural Networks: MLPClassifier, MLPRegressor
  • Others: Naive Bayes, K-Nearest Neighbors

When to use:

  • Classification: Predicting discrete categories (spam detection, image classification, fraud detection)
  • Regression: Predicting continuous values (price prediction, demand forecasting)

See: references/supervised_learning.md for detailed algorithm documentation, parameters, and usage examples.

2. Unsupervised Learning

Discover patterns in unlabeled data through clustering and dimensionality reduction.

Clustering algorithms:

  • Partition-based: K-Means, MiniBatchKMeans
  • Density-based: DBSCAN, HDBSCAN, OPTICS
  • Hierarchical: AgglomerativeClustering
  • Probabilistic: Gaussian Mixture Models
  • Others: MeanShift, SpectralClustering, BIRCH

Dimensionality reduction:

  • Linear: PCA, TruncatedSVD, NMF
  • Manifold learning: t-SNE, Isomap, LLE, MDS, ClassicalMDS (1.8+)
  • External (install separately): UMAP (umap-learn)
  • Feature extraction: FastICA, LatentDirichletAllocation

When to use:

  • Customer segmentation, anomaly detection, data visualization
  • Reducing feature dimensions, exploratory data analysis
  • Topic modeling, image compression

See: references/unsupervised_learning.md for detailed documentation.

3. Model Evaluation and Selection

Tools for robust model evaluation, cross-validation, and hyperparameter tuning.

Cross-validation strategies:

  • KFold, StratifiedKFold (classification)
  • TimeSeriesSplit (temporal data)
  • GroupKFold (grouped samples)

Hyperparameter tuning:

  • GridSearchCV (exhaustive search)
  • RandomizedSearchCV (random sampling)
  • HalvingGridSearchCV (successive halving)

Metrics:

  • Classification: accuracy, precision, recall, F1-score, ROC AUC, confusion matrix
  • Regression: MSE, RMSE, MAE, R², MAPE
  • Clustering: silhouette score, Calinski-Harabasz, Davies-Bouldin

When to use:

  • Comparing model performance objectively
  • Finding optimal hyperparameters
  • Preventing overfitting through cross-validation
  • Understanding model behavior with learning curves

See: references/model_evaluation.md for comprehensive metrics and tuning strategies.

4. Data Preprocessing

Transform raw data into formats suitable for machine learning.

Scaling and normalization:

  • StandardScaler (zero mean, unit variance)
  • MinMaxScaler (bounded range)
  • RobustScaler (robust to outliers)
  • Normalizer (sample-wise normalization)

Encoding categorical variables:

  • OneHotEncoder (nominal categories)
  • OrdinalEncoder (ordered categories)
  • LabelEncoder (target encoding)

Handling missing values:

  • SimpleImputer (mean, median, most frequent)
  • KNNImputer (k-nearest neighbors)
  • IterativeImputer (multivariate imputation)

Feature engineering:

  • PolynomialFeatures (interaction terms)
  • KBinsDiscretizer (binning)
  • Feature selection (RFE, SelectKBest, SelectFromModel)

When to use:

  • Before training any algorithm that requires scaled features (SVM, KNN, Neural Networks)
  • Converting categorical variables to numeric format
  • Handling missing data systematically
  • Creating non-linear features for linear models

See: references/preprocessing.md for detailed preprocessing techniques.

5. Pipelines and Composition

Build reproducible, production-ready ML workflows.

Key components:

  • Pipeline: Chain transformers and estimators sequentially
  • ColumnTransformer: Apply different preprocessing to different columns
  • FeatureUnion: Combine multiple transformers in parallel
  • TransformedTargetRegressor: Transform target variable

Benefits:

  • Prevents data leakage in cross-validation
  • Simplifies code and improves maintainability
  • Enables joint hyperparameter tuning
  • Ensures consistency between training and prediction

When to use:

  • Always use Pipelines for production workflows
  • When mixing numerical and categorical features (use ColumnTransformer)
  • When performing cross-validation with preprocessing steps
  • When hyperparameter tuning includes preprocessing parameters

See: references/pipelines_and_composition.md for comprehensive pipeline patterns.

Example Scripts

Classification Pipeline

Run a complete classification workflow with preprocessing, model comparison, hyperparameter tuning, and evaluation:

uv run python scripts/classification_pipeline.py

This script demonstrates:

  • Handling mixed data types (numeric and categorical)
  • Model comparison using cross-validation
  • Hyperparameter tuning with GridSearchCV
  • Comprehensive evaluation with multiple metrics
  • Feature importance analysis

Clustering Analysis

Perform clustering analysis with algorithm comparison and visualization:

uv run python scripts/clustering_analysis.py

This script demonstrates:

  • Finding optimal number of clusters (elbow method, silhouette analysis)
  • Comparing multiple clustering algorithms (K-Means, DBSCAN, Agglomerative, Gaussian Mixture)
  • Evaluating clustering quality without ground truth
  • Visualizing results with PCA projection

Reference Documentation

This skill includes comprehensive reference files for deep dives into specific topics:

Quick Reference

File: references/quick_reference.md

  • Common import patterns and installation instructions
  • Quick workflow templates for common tasks
  • Algorithm selection cheat sheets
  • Common patterns and gotchas
  • Performance optimization tips

Supervised Learning

File: references/supervised_learning.md

  • Linear models (regression and classification)
  • Support Vector Machines
  • Decision Trees and ensemble methods
  • K-Nearest Neighbors, Naive Bayes, Neural Networks
  • Algorithm selection guide

Unsupervised Learning

File: references/unsupervised_learning.md

  • All clustering algorithms with parameters and use cases
  • Dimensionality reduction techniques
  • Outlier and novelty detection
  • Gaussian Mixture Models
  • Method selection guide

Model Evaluation

File: references/model_evaluation.md

  • Cross-validation strategies
  • Hyperparameter tuning methods
  • Classification, regression, and clustering metrics
  • Learning and validation curves
  • Best practices for model selection

Preprocessing

File: references/preprocessing.md

  • Feature scaling and normalization
  • Encoding categorical variables
  • Missing value imputation
  • Feature engineering techniques
  • Custom transformers

Pipelines and Composition

File: references/pipelines_and_composition.md

  • Pipeline construction and usage
  • ColumnTransformer for mixed data types
  • FeatureUnion for parallel transformations
  • Complete end-to-end examples
  • Best practices

Common Workflows

Building a Classification Model

  1. Load and explore data

    import pandas as pd
    df = pd.read_csv('data.csv')
    X = df.drop('target', axis=1)
    y = df['target']
    
  2. Split data with stratification

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=42
    )
    
  3. Create preprocessing pipeline

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.compose import ColumnTransformer
    
    # Handle numeric and categorical features separately
    preprocessor = ColumnTransformer([
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])
    
  4. Build complete pipeline

    model = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=42))
    ])
    
  5. Tune hyperparameters

    from sklearn.model_selection import GridSearchCV
    
    param_grid = {
        'classifier__n_estimators': [100, 200],
        'classifier__max_depth': [10, 20, None]
    }
    
    grid_search = GridSearchCV(model, param_grid, cv=5)
    grid_search.fit(X_train, y_train)
    
  6. Evaluate on test set

    from sklearn.metrics import classification_report
    
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    print(classification_report(y_test, y_pred))
    

Performing Clustering Analysis

  1. Preprocess data

    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
  2. Find optimal number of clusters

    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_score
    
    scores = []
    for k in range(2, 11):
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(X_scaled)
        scores.append(silhouette_score(X_scaled, labels))
    
    optimal_k = range(2, 11)[np.argmax(scores)]
    
  3. Apply clustering

    model = KMeans(n_clusters=optimal_k, random_state=42)
    labels = model.fit_predict(X_scaled)
    
  4. Visualize with dimensionality reduction

    from sklearn.decomposition import PCA
    
    pca = PCA(n_components=2)
    X_2d = pca.fit_transform(X_scaled)
    
    plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis')
    

Best Practices

Always Use Pipelines

Pipelines prevent data leakage and ensure consistency:

# Good: Preprocessing in pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Bad: Preprocessing outside (can leak information)
X_scaled = StandardScaler().fit_transform(X)

Fit on Training Data Only

Never fit on test data:

# Good
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Only transform

# Bad
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))

Use Stratified Splitting for Classification

Preserve class distribution:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Set Random State for Reproducibility

model = RandomForestClassifier(n_estimators=100, random_state=42)

Choose Appropriate Metrics

  • Balanced data: Accuracy, F1-score
  • Imbalanced data: Precision, Recall, ROC AUC, Balanced Accuracy
  • Cost-sensitive: Define custom scorer

Scale Features When Required

Algorithms requiring feature scaling:

  • SVM, KNN, Neural Networks
  • PCA, Linear/Logistic Regression with regularization
  • K-Means clustering

Algorithms not requiring scaling:

  • Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
  • Naive Bayes

Troubleshooting Common Issues

ConvergenceWarning

Issue: Model didn't converge Solution: Increase max_iter or scale features

model = LogisticRegression(max_iter=1000)

Poor Performance on Test Set

Issue: Overfitting Solution: Use regularization, cross-validation, or simpler model

# Add regularization
model = Ridge(alpha=1.0)

# Use cross-validation
scores = cross_val_score(model, X, y, cv=5)

Memory Error with Large Datasets

Solution: Use algorithms designed for large data

# Use SGD for large datasets
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()

# Or MiniBatchKMeans for clustering
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=8, batch_size=100)

Additional Resources

how to use scikit-learn

How to use scikit-learn on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add scikit-learn
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/K-Dense-AI/scientific-agent-skills --skill scikit-learn

The skills CLI fetches scikit-learn from GitHub repository K-Dense-AI/scientific-agent-skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/scikit-learn

Reload or restart Cursor to activate scikit-learn. Access the skill through slash commands (e.g., /scikit-learn) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

Task Automation & Efficiency

Automate repetitive workflows and reduce manual effort

Example

Generate reports, summarize documents, draft communications

Save 3-5 hours per week on routine tasks

Knowledge Enhancement

Learn new skills, understand complex topics, get expert guidance

Example

Explain concepts, provide examples, suggest learning resources

Accelerate learning and skill development by 2x

Quality Improvement

Enhance output quality through reviews, suggestions, and refinements

Example

Review drafts, suggest improvements, catch errors

Improve work quality by 30-40% with less effort

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client with skill support
  • Clear understanding of task or problem to solve
  • Willingness to iterate and refine outputs

Time Estimate

15-45 minutes depending on use case complexity

Installation Steps

  1. 1.Install skill using provided installation command
  2. 2.Test with simple use case relevant to your work
  3. 3.Evaluate output quality and relevance
  4. 4.Iterate on prompts to improve results
  5. 5.Integrate into regular workflow if valuable

Common Pitfalls

  • Expecting perfect results without iteration
  • Not providing enough context in prompts
  • Using skill for tasks outside its intended scope
  • Accepting outputs without review and validation

Best Practices

✓ Do

  • +Start with clear, specific prompts
  • +Provide relevant context and constraints
  • +Review and refine all outputs before using
  • +Iterate to improve output quality
  • +Document successful prompt patterns

✗ Don't

  • Don't use without understanding skill limitations
  • Don't skip validation of outputs
  • Don't share sensitive information in prompts
  • Don't expect skill to replace human judgment

💡 Pro Tips

  • Be specific about desired format and style
  • Ask for multiple options to choose from
  • Request explanations to understand reasoning
  • Combine AI efficiency with human expertise

When to Use This

✓ Use When

Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.

✗ Avoid When

Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.

Learning Path

  1. 1Familiarize yourself with skill capabilities and limitations
  2. 2Start with low-risk, non-critical tasks
  3. 3Progress to more complex and valuable use cases
  4. 4Build expertise through regular use and experimentation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.630 reviews
  • Hiroshi Johnson· Dec 16, 2024

    scikit-learn is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Pratham Ware· Dec 12, 2024

    I recommend scikit-learn for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Diego Jain· Nov 7, 2024

    scikit-learn reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Yash Thakker· Nov 3, 2024

    Useful defaults in scikit-learn — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Sakura Srinivasan· Oct 26, 2024

    I recommend scikit-learn for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Dhruvi Jain· Oct 22, 2024

    scikit-learn is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Ren Taylor· Sep 17, 2024

    scikit-learn is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Piyush G· Sep 5, 2024

    scikit-learn fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Oshnikdeep· Sep 1, 2024

    Keeps context tight: scikit-learn is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Shikha Mishra· Aug 24, 2024

    Registry listing for scikit-learn matched our evaluation — installs cleanly and behaves as described in the markdown.

showing 1-10 of 30

1 / 3