AutoFlowLab
← Tutorials

AI Form Data Extraction: Turn Messy Submissions Into Clean Leads

Use AI with a JSON-schema prompt to pull name, company, budget, and intent from messy form submissions, validate the output, and route clean leads to your CRM.

May 26, 2026 · intermediate · 1 hour setup

The “Tell us about your project” textarea is where lead data goes to die. Somewhere in that paragraph is a company name, a budget, a timeline, and a buying signal — but your CRM gets a blob, your sales team gets homework, and your reporting gets nothing. Inbound email leads are worse: no fields at all.

This tutorial builds an extraction pipeline in Make, with an n8n variant at the end: free-text submissions go to an LLM with a strict JSON-schema prompt, the output gets validated (never trust an LLM with your CRM write access unsupervised), clean rows route to the CRM, and anything the model wasn’t confident about lands in a human-review sheet instead of silently polluting your pipeline. That confidence-gated routing is what makes this production-grade rather than a demo, and it’s why this is an intermediate build — budget about an hour.

What you’re building

  1. Trigger — a webhook catches form submissions (Typeform, Tally, a plain HTML form) or new inbound emails.
  2. AI extraction — one OpenAI call with a JSON-schema prompt returns structured fields plus a per-field confidence score.
  3. Validation — a parse step and explicit checks: is the email syntactically valid, is budget actually a number, did the model invent a company name?
  4. Routing — a router sends high-confidence rows to HubSpot (or any CRM) and Google Sheets; low-confidence rows go to a review sheet and a Slack ping.

Free template · make

Form Lead Extractor

form-lead-extractor-make.json

Download JSON

Prerequisites

  • Make account (each lead costs 6–9 operations; as of mid-2026, check current pricing for your volume)
  • OpenAI API key
  • HubSpot (free CRM tier works), Google Sheets, Slack
  • A form tool that can POST to a webhook, or a Gmail/IMAP mailbox receiving leads

Step 1: Catch the submission

  1. New scenario → Webhooks > Custom webhook, name it form-leads, copy the URL.
  2. Point your form at it. Typeform: Connect > Webhooks. Tally: Integrations > Webhooks. Plain HTML form: POST the fields as JSON.
  3. Submit one real test entry, then click Redetermine data structure on the webhook so Make learns your payload shape.

For email leads instead: use Gmail > Watch Emails filtered to your leads alias (q: to:leads@yourdomain.com is:unread), and treat subject + body as the free text. Everything downstream is identical — that’s the point of normalizing to one raw_text variable. Add a Tools > Set Variable module named raw_text that concatenates whatever fields exist:

Name field: {{name}}
Email field: {{email}}
Message: {{message}}

Include the structured fields you do have — the model uses them as anchors and you’ll handle the case where the “name” field contains “asdf” and the real name is in the message.

Step 2: The JSON-schema extraction prompt

  1. Add OpenAI > Create a Completion. Model: gpt-4o-mini for cost, gpt-4o if your messages are long or multilingual. Response format: JSON object. Temperature: 0. Max tokens: 800.
  2. Messages:
SYSTEM:
You are a lead-data extraction engine for a B2B services company. You return ONLY valid JSON matching the schema below. You never invent data: if a field is not stated or strongly implied in the input, you return null for it and lower your confidence. You never return markdown fences.

USER:
Extract lead data from this form submission / email.

SCHEMA (every key required, use null when unknown):
{
  "first_name": "string or null",
  "last_name": "string or null",
  "email": "string or null — must appear verbatim in the input, never construct one",
  "company": "string or null — the lead's company, not products they mention",
  "budget_usd": "integer or null — annual/project budget in USD. Convert '15k' to 15000, '$2,500/mo' to 30000 (annualize monthly). null if no figure given.",
  "timeline": "one of: immediate | this_quarter | this_year | exploring | null",
  "intent": "one of: buy | question | partnership | job_application | spam",
  "summary": "one sentence: who they are and what they want",
  "confidence": {
    "email": 0.0-1.0,
    "company": 0.0-1.0,
    "budget_usd": 0.0-1.0,
    "intent": 0.0-1.0
  }
}

CONFIDENCE RULES:
- 0.9+ only when the value is stated explicitly.
- 0.5-0.8 when inferred (e.g. company guessed from email domain).
- Below 0.5 when you are guessing. Guess less, null more.

INPUT:
{{raw_text}}

Three design decisions worth stealing for any extraction prompt. Enums over free texttimeline and intent are closed lists, so downstream filters are exact string matches instead of fuzzy logic. Normalization in the schema — “convert 15k to 15000” turns the messiest field (budget) into something you can actually report on. Confidence as data — asking the model to grade itself isn’t perfectly calibrated, but it’s reliably directional: the 0.4s are genuinely worse than the 0.9s, and that’s all a routing gate needs.

Step 3: Parse and validate

  1. JSON > Parse JSON, mapped from the completion result. Generate the data structure by pasting the schema sample.
  2. Now the checks the LLM can’t be trusted to do for itself. Add Tools > Set Multiple Variables:
    • email_valid: {{if(contains(email; "@") = false; false; if(contains(raw_text; email); true; false))}} — the second clause is the anti-hallucination check: the email must literally appear in the input. Models will construct jsmith@acme.com from “John Smith at Acme” if you let them.
    • budget_is_number: {{if(budget_usd = null; true; if(budget_usd >= 0; true; false))}} — null is fine, garbage isn’t.
    • min_confidence: {{min(confidence.email; confidence.intent)}}.
  3. These feed the router in the next step. Resist the urge to do validation inside the prompt — the model that made the mistake won’t catch the mistake.

Step 4: Route by confidence

  1. Add Flow Control > Router with three routes, top to bottom (Make evaluates in order; put the trash filter first):

Route 1 — Spam/irrelevant. Filter: intent equals spam OR intent equals job_application. Action: Google Sheets > Add a Row to a Discarded tab and stop. Logging discards beats deleting them — you’ll want to audit the classifier in week one.

Route 2 — Clean lead. Filter: email_valid = true AND min_confidence0.7 AND intent not equal to spam. Actions:

  1. HubSpot CRM > Create/Update a Contact — map first/last name, email, company. Map budget_usd and timeline to custom properties (create them once in HubSpot: Settings > Properties, number and dropdown types respectively).
  2. Google Sheets > Add a Row to your Leads master sheet with every field plus the confidence scores.
  3. Optionally Slack > Create a Message to #sales for budget_usd ≥ 10000 — a filter on the connection line handles that.

Route 3 — Human review (the fallback). No filter (catches everything the first two didn’t). Actions:

  1. Google Sheets > Add a Row to a Needs Review tab: the extracted fields, the confidence scores, and the original raw_text — reviewers need the source, not just the model’s guess.
  2. Slack > Create a Message to #leads-review: Low-confidence lead needs eyes: {{summary}} (email conf: {{confidence.email}}) → [review sheet link].

The review queue is the feature, not the failure mode. A model that’s wrong 5% of the time with no flag is a liability; a model that’s wrong 5% of the time and tells you which 5% is a colleague. Expect roughly 10–20% of real-world submissions to land in review at the 0.7 threshold; tune from there based on what your reviewers actually find.

Try it yourself

Make

Three-way confidence routing is two clicks on a Make router — the visual canvas makes the review path impossible to forget.

Start with Make

The n8n variant

The same pipeline in n8n, in five nodes — worth it if you’re self-hosting for data privacy (lead data never transits a third-party automation cloud) or already run n8n:

  1. Webhook node (POST, path form-leads).
  2. OpenAI node (or AI Agent with a structured output parser) — same prompt. n8n’s Structured Output Parser sub-node accepts a real JSON Schema and auto-retries on invalid output, which is genuinely better than prompt-only enforcement.
  3. Code node for validation — this is where n8n pulls ahead, because the email-in-source check is one honest line: return items.filter(i => i.json.email && $('Webhook').first().json.body.message.includes(i.json.email) ? i : {...i.json, email: null}); (adapt to your payload shape).
  4. Switch node on min_confidence with the same three branches.
  5. HubSpot / Google Sheets / Slack nodes per branch.

Self-hosted n8n’s costs don’t scale with lead volume — relevant if you process thousands of submissions monthly (as of mid-2026, check current cloud pricing for the hosted version).

Which platform should you use?

  • Make — best default: the router-with-filters pattern maps one-to-one onto the confidence-gating design, and non-engineers on your team can read the canvas and adjust thresholds.
  • n8n — pick it for self-hosting/privacy, code-level validation, or high volume where per-operation pricing stings. The structured output parser with auto-retry is the best JSON-enforcement story of the three tools.
  • Zapier — workable for a two-path version (clean vs. review) using Paths, but three-way routing plus per-field validation gets awkward in a linear editor, and Paths are a higher-tier feature (as of mid-2026, check current pricing). Full comparison: Make vs Zapier vs n8n.

Common errors and fixes

Parse JSON fails on maybe 1 in 50 leads. Almost always max-token truncation — a long rambling submission produced a long summary and the object got cut mid-string. Raise max tokens to 800+ and shorten the summary instruction. Add an error handler with a Resume directive routing to the review sheet, so even a parse failure becomes a review row rather than a lost lead.

Hallucinated emails reaching HubSpot. Your verbatim check (step 3) isn’t running, or it’s checking against the wrong variable — it must compare against raw_text, not the parsed output. This is the single most damaging failure mode in the build: a fabricated email creates a contact that bounces forever.

HubSpot returns “Property does not exist” (400). You mapped budget_usd before creating the custom property, or created it in a different object type (Contact vs. Deal). Create properties first, then reconnect the module so the field picker refreshes.

Gmail trigger re-processes old emails after re-auth. Google OAuth tokens for “less secure” scopes expire and on reconnect the watermark resets. Use the is:unread filter plus a Gmail > Mark as Read module at the end of every route — idempotency through state in the mailbox, not in Make.

Webhook returns 200 but the scenario didn’t run. Make queues webhook payloads when the scenario is off or you’ve hit plan limits; check the webhook’s queue in Webhooks settings. Also confirm your form sends Content-Type: application/json — Typeform does, raw HTML forms often send application/x-www-form-urlencoded, which Make parses differently than your saved data structure expects.

429 rate limits during a campaign spike. A landing-page launch can deliver 50 submissions a minute. Set the OpenAI module’s error handler to Break with 5 retries / 30-second intervals, and turn on Sequential processing in scenario settings so bundles queue instead of racing.

Everything lands in the review route. Check min_confidence is mapping real numbers — if the path is wrong, Make’s min() of empty values silently fails the ≥ 0.7 filter. Run once and inspect the Set Variables output bubble.

Where to take it next

Once extracted leads hit the CRM, the natural next step is scoring and follow-up — the lead capture and CRM tutorial picks up exactly where this one ends. The same schema-plus-confidence pattern also powers our invoice processing build (swap lead fields for invoice fields) and the AI email triage pipeline. And if these leads kick off projects, wire route 2 straight into client onboarding.

Schema, validation, confidence gate. Learn the pattern once and every messy-text-to-structured-data problem in your business becomes the same one-hour build.