Building data in the QSR & restaurant space
Most off-the-shelf restaurant lists sell you locations and call them accounts. They miss 60% of the buyers and point your outbound at the wrong door. Here is how a practitioner builds the real dataset — and why most of it is free.
There are roughly 200,000 limited-service restaurant locations in the United States, and over 60% of them are franchised. That single fact — locations and operators are different units of analysis — is the reason almost every restaurant dataset you can buy is structurally wrong. It treats a storefront as an account. The storefront is not the account. The operator is.
The cheap, repeatable way to build a usable QSR and restaurant dataset has two halves. First, anchor the universe in Secretary of State filings and franchise records, not in storefront listings. The buyer lives in entity records, not on Yelp. Second, classify every web source before you pay for it. Most of what you need — phone numbers, owner names, team pages, structured markup — is sitting in raw HTML for free. Only the genuinely hard slice should ever touch a paid API.
Done right, you can build a near-complete national QSR target list, resolved to the operator level with contact data, for a small fraction of the $25,000 to $150,000 vendors charge for the same thing.
restaurant locations
(not corporate-owned)
40% of franchise locations
§ 01 — The ThesisThe mismatch nobody fixes
A single multi-unit operator might run 50 Subway stores through 15 different LLCs. The location count and the buyer count are wildly different numbers, and a vendor list that ignores the difference will route 50 emails to the wrong addresses.
Three structural problems define this space:
- Locations and operators are different units. A storefront-anchored dataset will conflate them. An entity-anchored one will not.
- Brand names mask legal entity names. The "McDonald's" on the sign is a brand. The operating entity is something like "Crawford Holdings Restaurant Group #4 LLC." There are 13,000+ accounts behind the McDonald's brand in the US, not one.
- The buyer is the operator, not corporate. Corporate sets brand standards. The franchisee decides what POS, payroll, equipment, supplies, and services to buy. Outbound aimed at brand HQ misses the buyer entirely.
The storefront is not the account. The operator is. A dataset that doesn't solve for that is, structurally, the wrong dataset — no matter how clean it looks.
§ 02 — The TerrainWhy this is harder than SMB data
Standard SMB targeting assumes one company, one website, one buyer, one address. Restaurants violate every part of that assumption. The same operator shows up as fifteen LLCs. The same brand shows up as thousands of independent legal entities. The same address might serve as the registered office for a holdings company that runs locations in three states. And the buyer isn't on the website — the website is a marketing surface for the storefront, not the operating entity.
This is why entity resolution — the unglamorous work of figuring out which LLCs belong to which human — is the most valuable layer in the stack, and why most vendors skip it.
§ 03 — The SourcesThe five sources that actually matter
Five primary sources, in order of structural importance:
| Source | What it gives you | Why you need it |
|---|---|---|
| Secretary of State filings | Franchisee LLCs, registered addresses, officers, incorporation dates | The legal-entity backbone. Without this you have storefronts, not accounts. |
| Franchise Disclosure Documents | Franchisor's official list of franchisees by brand | Verifies the brand-to-franchisee link. Annual filings, up to a 12-month lag. |
| Alternative names / DBA records | Trade names like "Subway #38291" tied to a legal entity | The bridge from brand storefront to operating LLC. Coverage varies by state. |
| Credit-card transaction panels | Estimated per-location revenue and transaction volume | Sizing operators so you can prioritize. Estimates, not exact figures. |
| Health department permits | Active operating permit per location | Confirms the location is actually open, not dark with a still-active LLC. |
Storefront sources — Yelp, Google Business, brand store locators — are useful for verification and address normalization, but should never be the spine of the dataset. They have no operator data and no entity resolution. They cannot tell you that twelve of those locations belong to the same buyer.
POS technographic panels (Toast, Square, Clover, Lightspeed) are a useful secondary signal for tech-stack targeting, but they are biased toward the customer bases of those specific vendors. Use them to enrich a universe, not to define one.
§ 04 — The BuildThe six steps, end to end
Define the universe in SoS data
Start with active legal entities filed under NAICS 722513 (limited-service restaurants), 722515 (snack and beverage bars), and where relevant 722511 (full-service, which catches fast casual brands depending on state classification). Filter to current_status = active and inactive = false. The result is roughly 280,000–320,000 entities nationally — a higher number than the 200,000-location count, because multi-unit operators file separate LLCs per location or per small cluster.
Resolve the brand for each entity
Join the alternative names file on company_number + jurisdiction_code. The DBA filings are where "Crawford Holdings #4 LLC" becomes "Subway #38291." Pattern-match the trade names against a brand reference list. Every entity now lands in one of three buckets: brand-franchised, brand-corporate (registered as a branch of the franchisor), or independent.
Roll up to the operator level
The step that turns a list of locations into a list of accounts. Link multiple LLCs owned by the same operator using person_uid on the officer file, common officer name + address when UIDs are unavailable, and control statements in the corporate relationships file. The output: an operator-level account list where each account has 1 to N locations attached.
Size with transaction data
Match credit-card transaction panel estimates to each location's registered_address. Sum per-location estimates by operator. Now every account has both a location count and an estimated revenue. Two 50-location operators in different brands can be similarly sized; a 50-location independent cluster is structurally a different conversation.
Verify operating status
Match locations against state and county health department permit files where available. An active permit confirms the location is open. An expired or revoked permit means the storefront is dark even if the LLC is still active in SoS — which happens constantly, especially after lease terminations and brand churn.
Add operator contact data
Join officer_phone and officer_email_hem on the principal officer, typically the Managing Member whose start_date matches the entity's incorporation_date. For multi-unit operators, the operator's contact info is usually the same person across all their LLCs — so you contact the human, not each LLC.
§ 05 — The Web LayerThe part that costs almost nothing
Steps 1–6 give you the entity spine. But for outbound, you usually also want the public-facing artifacts — the restaurant's website, the team page, the manager's direct email, the booking link. That is the part most vendors charge per-domain for.
It shouldn't cost what they charge. Most of it is sitting in raw HTML.
Classify before you extract
Before pulling anything off a website, fetch the homepage and sort it into one of five buckets based on what's actually in the response body — not the status code. A Cloudflare challenge and an empty React shell both return a cheerful 200.
- Email already in the raw HTML. Best case — free.
- Plain server-rendered text. Free with deterministic extraction.
- Empty JavaScript shell. Needs a headless render — paid.
- Cloudflare or bot-wall. Needs a paid bypass or archived copy.
- Dead domain. Skip.
On an independent restaurant sample, the free buckets typically cover 60–70% of sites. Restaurants skew toward simpler websites than tech companies; many run on Squarespace, Wix, or template builders that serve clean static HTML. The free share for restaurants is actually higher than it is for, say, law firms or B2B SaaS.
The classification step tells you — on a cheap 500-row sample — what the expensive run will cost on the full market, before you commit a dollar to it. The route split is the cost model.
What deterministic code can pull for free
On a free-bucket site, regex and parsers handle most of the work:
- mailto: links — straight regex.
- Cloudflare-obfuscated emails — hex-encoded with a one-byte XOR key, fully reversible without running their JavaScript.
- Schema.org / JSON-LD blocks — author-declared structured markup that hands you name, title, phone, hours, cuisine, and address in one shot. Restaurants use this heavily for SEO. A goldmine.
- Team and staff cards — the photo-name-title block on "Meet the Owner" pages, paired by DOM proximity.
None of this needs an LLM. The majority of restaurant homepages don't need to touch a model at all.
When to send a page to a model
Reserve the LLM for the genuinely ambiguous pages — six staff members, four emails, no clean structural pairing. Run the page (which you already have on disk from the classification pass — don't re-crawl) through a cheap model like Gemini Flash-Lite, scoped only to the fields that didn't resolve deterministically.
For restaurants, the named-contact lift from this re-read is significant. Independent restaurants often hide owners behind info@ mailboxes, and the model pass can pull the owner's actual name off the About page and attach it to the inbox — without re-fetching anything.
The grounding gate
A model that pairs names to emails will fabricate. Given a first.last@ address, it will reverse-engineer "First Last" as the owner — even if that person appears nowhere on the page. That is a fabrication wearing a confidence score. Skip this step and you will burn your sending reputation in a week.
Two gates before you trust any model-attached name:
- Deterministic floor. Both the first and last word of the name must literally appear in the page text. Reject otherwise. This single rule catches most reverse-engineered names.
- Adversarial second pass. A separate model gets the page and the proposed pairing with one job: prove this is the wrong person. Default to "wrong" if uncertain.
Only pairings that survive both gates ship. With the gates, named-to-email precision runs near 100%. Without them, you ship fabrications.
What restaurant pages give you that other verticals don't
- Reservation / ordering platform in the source is a clean technographic signal (OpenTable, Resy, Toast, Square, ChowNow) — pulled deterministically from script tags.
- Cuisine, price range, hours, and address are almost always in schema markup.
- Owner stories on About pages are unusually rich. In restaurants the owner is the brand, and they say so on the page.
§ 06 — The ResidueWhere you actually pay
After all of the above, what remains:
- Cloudflare-walled sites (~15–20% of restaurants)
- Sites with no named owner anywhere in the HTML
- Locations where you have an entity but no website at all
- Operator personal contact info beyond what is in SoS officer records
This is the residue — the only slice where paid tools earn their keep. Contact APIs, render services, archived Common Crawl copies, consumer-attribute overlays.
Build the paid tier as a separate layer with a hard dollar cap per run. The only safe way to escalate paid lookups across 100,000 restaurants is to make overspend structurally impossible. Test the billing logic on a small live sample — failed API calls counted as money spent, or endpoints you aren't actually subscribed to silently returning 401s, are bugs that only show up the instant real money touches real domains.
§ 07 — The OverlayOperator-level consumer attributes
Because the QSR operator is typically the buyer, and the operator's personal profile correlates with their business, consumer attributes overlaid on the operator's contact address are useful for prioritization within revenue bands:
- Vehicle ownership. Operators with multiple high-MSRP vehicles (
Auto_MSRP_Max > $50,000,Auto_Class_Luxury) skew toward higher-revenue multi-unit operators. Single mainstream vehicles skew toward single-unit owner-operators. - Home value and real estate. Operator home value correlates strongly with operator business revenue in QSR.
- Household composition. Family-owned operations show different buying patterns than single-operator businesses.
These signals don't replace transaction-panel data for sizing. They refine prioritization once sizing is done.
§ 08 — The LessonsWhat real data will break
Some specifics worth bracing for. Every one of these gets caught on a 500-row sample for pennies — if you run it on real restaurant websites instead of fixtures.
- Sentry error-tracking strings that match email regex.
- Words that decode as emails. "cre-AT-ive" gives you an
@if you're not careful. - Section headers grabbed as names. "Areas of Practice," "Catering Inquiries," "Beginning Spring 2026" — all of these have been read as a person's name.
- DNS resolver throttling. Batch-check 50,000 domains from one IP and half the market will look dead until you slow down or distribute the lookups.
- State NAICS classification drift. Fast casual chains land under 722511 in some states and 722513 in others. Pull both.
- Branch-vs-entity confusion. Corporate-owned locations register as branches of the franchisor, not independent entities. Treat them as accounts and you double-count.
Lazy first. Crazy later. Read the free pages with free code, send only the hard ones to a model, and pay only for what is actually left.
§ 09 — Frequent QuestionsWhat practitioners ask
branch = 'F') of the franchisor's corporate entity, with home_jurisdiction_company_number pointing back to corporate. Franchisee-owned locations register as independent LLCs with the operator as the officer.❦



