Datasets

Two complementary dataset families built by 60,000+ STEM educators: frontier-level text Q&A pairs for reasoning and instruction tuning, and the largest multimodal STEM dataset for training models that can see, read, and explain.

5M+
Q&A Pairs & Video Solutions
50+
STEM Disciplines
60K+
Contributing Educators
100%
Human-created Content
Dataset 01

Frontier STEM Q&A

Millions of expert-written, text-based question-answer pairs with full worked solutions and structured reasoning chains. Built for fine-tuning, RLHF, and evaluation across the full spectrum of STEM difficulty.

Expert-verified Q&A Pairs

Millions of question-answer pairs spanning introductory through graduate-level STEM, each written and verified by credentialed subject-matter experts. Answers include full worked solutions, not just final results.

  • Covers 50+ disciplines: math, physics, chemistry, biology, engineering, economics, and more
  • Difficulty-stratified from introductory coursework to PhD-qualifying problems
  • Written by professors, graduate researchers, and experienced educators

Multi-step Reasoning Chains

Solutions are structured as explicit chains of reasoning, with each step isolated and labeled. This makes the data directly usable for training and evaluating chain-of-thought capabilities in language models.

  • Step boundaries annotated for segmented fine-tuning
  • Intermediate reasoning exposed, not just input/output pairs
  • Conceptual justification included alongside mathematical derivation

Rich Metadata & Taxonomy

Every Q&A pair is tagged with structured metadata that supports curriculum-aware training, difficulty-based sampling, and prerequisite modeling.

  • Subject, topic, and subtopic classification
  • Difficulty rating (introductory, intermediate, advanced, graduate)
  • Prerequisite concept mappings for scaffolded training

Training-ready Formats

Data is delivered in formats designed for modern AI pipelines. Whether you're doing supervised fine-tuning, RLHF, or building evaluation benchmarks, the data is structured to plug in directly.

  • Instruction-tuning format with question/answer/reasoning fields
  • Solution step boundaries for chain-of-thought training
  • Compatible with standard fine-tuning and evaluation frameworks
Dataset 02

Multimodal STEM Reasoning

5M+ educator-created video solutions with aligned transcriptions, visual aid frames, and cross-modal pairings. The richest source of paired text-visual STEM data available for training multimodal models.

Step-by-Step Video Solutions

Full video walkthroughs of STEM problems recorded by verified educators. Each video captures the expert's reasoning process, including how they set up the problem, which visual aids they draw, and how they narrate each step.

  • 5M+ videos across math, physics, chemistry, biology, and engineering
  • Average 2.5 minutes per video with continuous visual annotation
  • Sourced from 60,000+ subject-matter experts

Aligned Transcriptions

Time-stamped transcriptions of every video, linking spoken explanation to the exact visual state of the screen. This pairing is what makes the data uniquely valuable for multimodal training.

  • Word-level timestamps aligned to video frames
  • Speaker narration paired with on-screen visual changes
  • Structured for text-to-visual and visual-to-text tasks

Visual Aid Frames

Extracted keyframes capturing the diagrams, graphs, equations, and illustrations that educators create while solving problems. These are not stock images. They are pedagogically motivated visual aids drawn in real time.

  • Diagrams, free-body diagrams, circuit schematics, molecular structures
  • Graphs, plots, and data tables with expert annotations
  • Step-by-step construction sequences showing how visual aids are built

Cross-modal Pairings

Tightly coupled text, audio, and visual data for every solution. Each modality reinforces the others, giving models the training signal they need to reason across representations.

  • Spoken explanation ↔ on-screen diagram at every timestamp
  • Written solution steps mapped to corresponding visual states
  • Enables joint training on text, vision, and pedagogical structure

Subject coverage

Both datasets span the full breadth of undergraduate and graduate STEM curricula, with depth in the disciplines that matter most for frontier reasoning.

CalculusLinear AlgebraDifferential EquationsProbability & StatisticsReal AnalysisAbstract AlgebraPhysics I & IIQuantum MechanicsElectromagnetismThermodynamicsClassical MechanicsGeneral RelativityOrganic ChemistryPhysical ChemistryBiochemistryGeneral BiologyMolecular BiologyGeneticsStatics & DynamicsFluid MechanicsCircuit AnalysisSignal ProcessingControl SystemsMaterials ScienceComputer ScienceData StructuresMacroeconomicsMicroeconomicsEconometricsConvex Optimization+ many more

How they're used

Our datasets are designed to slot into modern AI training pipelines, from pre-training through evaluation.

Supervised fine-tuning

Use expert-verified Q&A pairs with structured reasoning chains to fine-tune foundation models for STEM problem solving and explanation generation.

RLHF & reward modeling

Leverage expert solutions as a reward signal to align models toward accurate, well-structured STEM reasoning rather than surface-level pattern matching.

Multimodal pre-training

Train vision-language models on paired visual aids and spoken explanations to develop cross-modal reasoning grounded in real pedagogical content.

Benchmark construction

Build rigorous evaluation suites from expert-annotated data, covering answer correctness, explanation quality, visual reasoning, and pedagogical structure.

Chain-of-thought training

Step-segmented solutions provide explicit intermediate reasoning, ideal for teaching models to show their work and reason through multi-step problems.

Synthetic data generation

Use the dataset as a high-quality seed corpus to generate synthetic training examples that preserve the reasoning depth and domain accuracy of the originals.

Quality at every layer

Every piece of data traces back to a verified human educator. There is no synthetic generation in the source corpus. Only real experts solving real problems, with real visual aids where applicable. This is what makes both datasets uniquely suited for training models that need to reason, not just pattern-match.