scikit-learn▌
K-Dense-AI/scientific-agent-skills · updated Jun 4, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
### Scikit Learn
- ›name: "scikit-learn"
- ›description: "Machine learning in Python with scikit-learn. Use when working with supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), model evaluation, hy..."
- ›allowed-tools: "Read Write Edit Bash"
| name | scikit-learn |
| description | Machine learning in Python with scikit-learn. Use when working with supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), model evaluation, hyperparameter tuning, preprocessing, or building ML pipelines. Provides comprehensive reference documentation for algorithms, preprocessing techniques, pipelines, and best practices. |
| license | BSD-3-Clause license |
| allowed-tools | Read Write Edit Bash |
| compatibility | Requires Python 3.11+ and scikit-learn 1.7+. NumPy and SciPy are required dependencies. Optional matplotlib/seaborn for bundled example scripts that save plots. |
| metadata | version: "1.1" skill-author: K-Dense Inc. |
Scikit-learn
Overview
This skill provides comprehensive guidance for machine learning tasks using scikit-learn, the industry-standard Python library for classical machine learning. Use this skill for classification, regression, clustering, dimensionality reduction, preprocessing, model evaluation, and building production-ready ML pipelines.
Installation
Tested against scikit-learn 1.8.0 (stable; December 2025). Requires Python 3.11–3.14 (free-threaded CPython 3.14 wheels available in 1.8+).
Install the PyPI package scikit-learn (not the deprecated sklearn package on PyPI). Import in code as sklearn.
# Install scikit-learn using uv
uv pip install "scikit-learn>=1.7"
# Optional: plotting utilities and bundled script dependencies
uv pip install "scikit-learn[plots]" matplotlib seaborn
# Commonly used with
uv pip install pandas numpy
Check your version:
import sklearn
print(sklearn.__version__)
When to Use This Skill
Use the scikit-learn skill when:
- Building classification or regression models
- Performing clustering or dimensionality reduction
- Preprocessing and transforming data for machine learning
- Evaluating model performance with cross-validation
- Tuning hyperparameters with grid or random search
- Creating ML pipelines for production workflows
- Comparing different algorithms for a task
- Working with both structured (tabular) and text data
- Need interpretable, classical machine learning approaches
Quick Start
Classification Example
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
Complete Pipeline with Mixed Data
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
# Define feature types
numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']
# Create preprocessing pipelines
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline
model = Pipeline([
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(random_state=42))
])
# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Core Capabilities
1. Supervised Learning
Comprehensive algorithms for classification and regression tasks.
Key algorithms:
- Linear models: Logistic Regression, Linear Regression, Ridge, Lasso, ElasticNet
- Tree-based: Decision Trees, Random Forest, Gradient Boosting
- Support Vector Machines: SVC, SVR with various kernels
- Ensemble methods: AdaBoost, Voting, Stacking
- Neural Networks: MLPClassifier, MLPRegressor
- Others: Naive Bayes, K-Nearest Neighbors
When to use:
- Classification: Predicting discrete categories (spam detection, image classification, fraud detection)
- Regression: Predicting continuous values (price prediction, demand forecasting)
See: references/supervised_learning.md for detailed algorithm documentation, parameters, and usage examples.
2. Unsupervised Learning
Discover patterns in unlabeled data through clustering and dimensionality reduction.
Clustering algorithms:
- Partition-based: K-Means, MiniBatchKMeans
- Density-based: DBSCAN, HDBSCAN, OPTICS
- Hierarchical: AgglomerativeClustering
- Probabilistic: Gaussian Mixture Models
- Others: MeanShift, SpectralClustering, BIRCH
Dimensionality reduction:
- Linear: PCA, TruncatedSVD, NMF
- Manifold learning: t-SNE, Isomap, LLE, MDS, ClassicalMDS (1.8+)
- External (install separately): UMAP (
umap-learn) - Feature extraction: FastICA, LatentDirichletAllocation
When to use:
- Customer segmentation, anomaly detection, data visualization
- Reducing feature dimensions, exploratory data analysis
- Topic modeling, image compression
See: references/unsupervised_learning.md for detailed documentation.
3. Model Evaluation and Selection
Tools for robust model evaluation, cross-validation, and hyperparameter tuning.
Cross-validation strategies:
- KFold, StratifiedKFold (classification)
- TimeSeriesSplit (temporal data)
- GroupKFold (grouped samples)
Hyperparameter tuning:
- GridSearchCV (exhaustive search)
- RandomizedSearchCV (random sampling)
- HalvingGridSearchCV (successive halving)
Metrics:
- Classification: accuracy, precision, recall, F1-score, ROC AUC, confusion matrix
- Regression: MSE, RMSE, MAE, R², MAPE
- Clustering: silhouette score, Calinski-Harabasz, Davies-Bouldin
When to use:
- Comparing model performance objectively
- Finding optimal hyperparameters
- Preventing overfitting through cross-validation
- Understanding model behavior with learning curves
See: references/model_evaluation.md for comprehensive metrics and tuning strategies.
4. Data Preprocessing
Transform raw data into formats suitable for machine learning.
Scaling and normalization:
- StandardScaler (zero mean, unit variance)
- MinMaxScaler (bounded range)
- RobustScaler (robust to outliers)
- Normalizer (sample-wise normalization)
Encoding categorical variables:
- OneHotEncoder (nominal categories)
- OrdinalEncoder (ordered categories)
- LabelEncoder (target encoding)
Handling missing values:
- SimpleImputer (mean, median, most frequent)
- KNNImputer (k-nearest neighbors)
- IterativeImputer (multivariate imputation)
Feature engineering:
- PolynomialFeatures (interaction terms)
- KBinsDiscretizer (binning)
- Feature selection (RFE, SelectKBest, SelectFromModel)
When to use:
- Before training any algorithm that requires scaled features (SVM, KNN, Neural Networks)
- Converting categorical variables to numeric format
- Handling missing data systematically
- Creating non-linear features for linear models
See: references/preprocessing.md for detailed preprocessing techniques.
5. Pipelines and Composition
Build reproducible, production-ready ML workflows.
Key components:
- Pipeline: Chain transformers and estimators sequentially
- ColumnTransformer: Apply different preprocessing to different columns
- FeatureUnion: Combine multiple transformers in parallel
- TransformedTargetRegressor: Transform target variable
Benefits:
- Prevents data leakage in cross-validation
- Simplifies code and improves maintainability
- Enables joint hyperparameter tuning
- Ensures consistency between training and prediction
When to use:
- Always use Pipelines for production workflows
- When mixing numerical and categorical features (use ColumnTransformer)
- When performing cross-validation with preprocessing steps
- When hyperparameter tuning includes preprocessing parameters
See: references/pipelines_and_composition.md for comprehensive pipeline patterns.
Example Scripts
Classification Pipeline
Run a complete classification workflow with preprocessing, model comparison, hyperparameter tuning, and evaluation:
uv run python scripts/classification_pipeline.py
This script demonstrates:
- Handling mixed data types (numeric and categorical)
- Model comparison using cross-validation
- Hyperparameter tuning with GridSearchCV
- Comprehensive evaluation with multiple metrics
- Feature importance analysis
Clustering Analysis
Perform clustering analysis with algorithm comparison and visualization:
uv run python scripts/clustering_analysis.py
This script demonstrates:
- Finding optimal number of clusters (elbow method, silhouette analysis)
- Comparing multiple clustering algorithms (K-Means, DBSCAN, Agglomerative, Gaussian Mixture)
- Evaluating clustering quality without ground truth
- Visualizing results with PCA projection
Reference Documentation
This skill includes comprehensive reference files for deep dives into specific topics:
Quick Reference
File: references/quick_reference.md
- Common import patterns and installation instructions
- Quick workflow templates for common tasks
- Algorithm selection cheat sheets
- Common patterns and gotchas
- Performance optimization tips
Supervised Learning
File: references/supervised_learning.md
- Linear models (regression and classification)
- Support Vector Machines
- Decision Trees and ensemble methods
- K-Nearest Neighbors, Naive Bayes, Neural Networks
- Algorithm selection guide
Unsupervised Learning
File: references/unsupervised_learning.md
- All clustering algorithms with parameters and use cases
- Dimensionality reduction techniques
- Outlier and novelty detection
- Gaussian Mixture Models
- Method selection guide
Model Evaluation
File: references/model_evaluation.md
- Cross-validation strategies
- Hyperparameter tuning methods
- Classification, regression, and clustering metrics
- Learning and validation curves
- Best practices for model selection
Preprocessing
File: references/preprocessing.md
- Feature scaling and normalization
- Encoding categorical variables
- Missing value imputation
- Feature engineering techniques
- Custom transformers
Pipelines and Composition
File: references/pipelines_and_composition.md
- Pipeline construction and usage
- ColumnTransformer for mixed data types
- FeatureUnion for parallel transformations
- Complete end-to-end examples
- Best practices
Common Workflows
Building a Classification Model
-
Load and explore data
import pandas as pd df = pd.read_csv('data.csv') X = df.drop('target', axis=1) y = df['target'] -
Split data with stratification
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) -
Create preprocessing pipeline
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer # Handle numeric and categorical features separately preprocessor = ColumnTransformer([ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(), categorical_features) ]) -
Build complete pipeline
model = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(random_state=42)) ]) -
Tune hyperparameters
from sklearn.model_selection import GridSearchCV param_grid = { 'classifier__n_estimators': [100, 200], 'classifier__max_depth': [10, 20, None] } grid_search = GridSearchCV(model, param_grid, cv=5) grid_search.fit(X_train, y_train) -
Evaluate on test set
from sklearn.metrics import classification_report best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_test) print(classification_report(y_test, y_pred))
Performing Clustering Analysis
-
Preprocess data
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) -
Find optimal number of clusters
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score scores = [] for k in range(2, 11): kmeans = KMeans(n_clusters=k, random_state=42) labels = kmeans.fit_predict(X_scaled) scores.append(silhouette_score(X_scaled, labels)) optimal_k = range(2, 11)[np.argmax(scores)] -
Apply clustering
model = KMeans(n_clusters=optimal_k, random_state=42) labels = model.fit_predict(X_scaled) -
Visualize with dimensionality reduction
from sklearn.decomposition import PCA pca = PCA(n_components=2) X_2d = pca.fit_transform(X_scaled) plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis')
Best Practices
Always Use Pipelines
Pipelines prevent data leakage and ensure consistency:
# Good: Preprocessing in pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# Bad: Preprocessing outside (can leak information)
X_scaled = StandardScaler().fit_transform(X)
Fit on Training Data Only
Never fit on test data:
# Good
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Only transform
# Bad
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
Use Stratified Splitting for Classification
Preserve class distribution:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
Set Random State for Reproducibility
model = RandomForestClassifier(n_estimators=100, random_state=42)
Choose Appropriate Metrics
- Balanced data: Accuracy, F1-score
- Imbalanced data: Precision, Recall, ROC AUC, Balanced Accuracy
- Cost-sensitive: Define custom scorer
Scale Features When Required
Algorithms requiring feature scaling:
- SVM, KNN, Neural Networks
- PCA, Linear/Logistic Regression with regularization
- K-Means clustering
Algorithms not requiring scaling:
- Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
- Naive Bayes
Troubleshooting Common Issues
ConvergenceWarning
Issue: Model didn't converge
Solution: Increase max_iter or scale features
model = LogisticRegression(max_iter=1000)
Poor Performance on Test Set
Issue: Overfitting Solution: Use regularization, cross-validation, or simpler model
# Add regularization
model = Ridge(alpha=1.0)
# Use cross-validation
scores = cross_val_score(model, X, y, cv=5)
Memory Error with Large Datasets
Solution: Use algorithms designed for large data
# Use SGD for large datasets
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
# Or MiniBatchKMeans for clustering
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=8, batch_size=100)
Additional Resources
- Official Documentation: https://scikit-learn.org/stable/
- User Guide: https://scikit-learn.org/stable/user_guide.html
- API Reference: https://scikit-learn.org/stable/api/index.html
- Examples Gallery: https://scikit-learn.org/stable/auto_examples/index.html
How to use scikit-learn on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your development machine
- ›Node.js version 16.0+ with npm package manager (verify with
node --version) - ›Active project directory or workspace where you want to add scikit-learn
Execute installation command
Execute the skills CLI command in your project's root directory to begin installation:
The skills CLI fetches scikit-learn from GitHub repository K-Dense-AI/scientific-agent-skills and configures it for Cursor.
Select Cursor when prompted
The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Reload or restart Cursor to activate scikit-learn. Access the skill through slash commands (e.g., /scikit-learn) or your agent's skill management interface.
Security & Verification Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases▌
Task Automation & Efficiency
Automate repetitive workflows and reduce manual effort
Example
Generate reports, summarize documents, draft communications
Save 3-5 hours per week on routine tasks
Knowledge Enhancement
Learn new skills, understand complex topics, get expert guidance
Example
Explain concepts, provide examples, suggest learning resources
Accelerate learning and skill development by 2x
Quality Improvement
Enhance output quality through reviews, suggestions, and refinements
Example
Review drafts, suggest improvements, catch errors
Improve work quality by 30-40% with less effort
Implementation Guide▌
Prerequisites
- ›Claude Desktop or compatible AI client with skill support
- ›Clear understanding of task or problem to solve
- ›Willingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Installation Steps
- 1.Install skill using provided installation command
- 2.Test with simple use case relevant to your work
- 3.Evaluate output quality and relevance
- 4.Iterate on prompts to improve results
- 5.Integrate into regular workflow if valuable
Common Pitfalls
- ⚠Expecting perfect results without iteration
- ⚠Not providing enough context in prompts
- ⚠Using skill for tasks outside its intended scope
- ⚠Accepting outputs without review and validation
Best Practices▌
✓ Do
- +Start with clear, specific prompts
- +Provide relevant context and constraints
- +Review and refine all outputs before using
- +Iterate to improve output quality
- +Document successful prompt patterns
✗ Don't
- −Don't use without understanding skill limitations
- −Don't skip validation of outputs
- −Don't share sensitive information in prompts
- −Don't expect skill to replace human judgment
💡 Pro Tips
- ★Be specific about desired format and style
- ★Ask for multiple options to choose from
- ★Request explanations to understand reasoning
- ★Combine AI efficiency with human expertise
When to Use This▌
✓ Use When
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
✗ Avoid When
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path▌
- 1Familiarize yourself with skill capabilities and limitations
- 2Start with low-risk, non-critical tasks
- 3Progress to more complex and valuable use cases
- 4Build expertise through regular use and experimentation
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.6★★★★★30 reviews- ★★★★★Hiroshi Johnson· Dec 16, 2024
scikit-learn is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Pratham Ware· Dec 12, 2024
I recommend scikit-learn for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Diego Jain· Nov 7, 2024
scikit-learn reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Yash Thakker· Nov 3, 2024
Useful defaults in scikit-learn — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Sakura Srinivasan· Oct 26, 2024
I recommend scikit-learn for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Dhruvi Jain· Oct 22, 2024
scikit-learn is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Ren Taylor· Sep 17, 2024
scikit-learn is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Piyush G· Sep 5, 2024
scikit-learn fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Oshnikdeep· Sep 1, 2024
Keeps context tight: scikit-learn is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Shikha Mishra· Aug 24, 2024
Registry listing for scikit-learn matched our evaluation — installs cleanly and behaves as described in the markdown.
showing 1-10 of 30