entity_resolve — Watt Data Docs

Resolve entity identities by matching emails, phones, addresses, MAIDs, websites, or social handles. Supports multi-criterion queries with Noisy-OR quality score aggregation. Returns entity IDs grouped by individual with quality scores.

Quick Example

{
  "entity_type": "person",
  "identifiers": [
    {
      "id_type": "email",
      "hash_type": "plaintext",
      "values": ["alice@example.com", "bob@example.com"]
    }
  ]
}

Input Parameters

Parameter	Type	Required	Default	Constraints	Description
entity_type	string	Yes	-	"person" or "business"	Type of entity to resolve
identifiers	array	Conditional	-	Max 50 groups per request; each group's `values` array is capped at 3,000 entries — use `csv_resource_uri` for larger inputs	Multi-criterion identifiers. Mutually exclusive with csv_resource_uri
csv_resource_uri	string	Conditional	-	workflow:// URI	CSV file with identifiers. Mutually exclusive with identifiers
lookup_columns	object	Conditional	-	See below	Column-mapping for CSV-based resolution. Required when `csv_resource_uri` is set; at least one sub-key (`email`, `phone`, `address`, `name`, `linkedin`, `domain`) must have non-empty `names`
offset	number	No	0	Integer ≥ 0	Number of CSV data rows to skip before reading. Use with `limit` to paginate large CSVs across multiple calls. Only applies to `csv_resource_uri`; ignored when `identifiers` is used
limit	number	No	200000	1 ≤ limit ≤ 200000	Maximum number of CSV data rows to read in this call. Only applies to `csv_resource_uri`; ignored when `identifiers` is used
format	string	No	"none"	"none", "csv", "json", "jsonl"	Export format - generates presigned S3 URL valid for 1 hour
identifier_types	array	No	person → ["email"], business → ["name"]	person: "name", "email", "phone", "address", "maid", "social:linkedin" — business: "name", "phone", "address", "social:linkedin", "website"	Contact types to return in identifiers field (allowed values depend on entity_type)
workflow_id	string	No	-	Valid UUID	Workflow session identifier for correlation

Parameter Details:

entity_type:

Required. Use "person" for individual identities or "business" for company entities.

identifiers:

Array of objects, each specifying id_type, hash_type, and values[]
Allows querying across different identifier types in one call
Email/phone/maid can be mixed in a single call
Address identifiers can also be included alongside other types
Returns Noisy-OR aggregated overall_quality_score per entity
Capped at 50 identifier groups per request — split larger inputs into multiple calls
Each identifier group's values array is capped at 3,000 entries — for larger inputs use csv_resource_uri (governed by a separate 200,000-row cap)
Mutually exclusive with csv_resource_uri

csv_resource_uri:

Workflow resource URI pointing to a CSV file (e.g., workflow://{workflow_id}/uploads/customers.csv)
Requires lookup_columns with at least one identifier type populated
The CSV is processed in pages of up to 200,000 rows per call. When more rows remain, the response includes a next_offset field — pass it back as offset on the next call. The field is omitted on the last page.
Mutually exclusive with identifiers

lookup_columns (CSV mode):

Maps CSV columns to identifier types. The same shape is used by resolve_and_enrich_rows — see Conventions → CSV Column Mapping for the canonical reference, including per-identifier rules, multi-column address joining, and the address_parse_low_yield warning.

{
  email?:    { names: string[]; hash_type?: "plaintext" | "md5" | "sha1" | "sha256" },
  phone?:    { names: string[]; hash_type?: "plaintext" | "md5" | "sha1" | "sha256" },
  address?:  { names: string[] },
  name?:     { names: string[] },
  linkedin?: { names: string[] },   // resolves via the `social:linkedin` identifier type
  domain?:   { names: string[] }    // business entities — resolves via the `website` identifier type
}

At least one sub-key with non-empty names is required. Only email and phone accept hash_type; the other types require plaintext values. When address.names lists more than one column, per-row cell values are concatenated in listed order with ", " before libpostal parsing — list them street-first (address1, address2?, city?, region?, postcode, country?).

Migration from legacy *_columns keys: The flat email_columns, phone_columns, and address_columns parameters from earlier V2 betas are rejected with a per-key error naming the lookup_columns.<key> replacement. Update existing callers to the nested shape.

Supported id_types:

Person entities:

"email" - Email addresses with automatic normalization
"phone" - Phone numbers (E.164 format recommended)
"address" - Physical addresses (libpostal-parsed component matching with apartment/unit resolution)
"maid" - Mobile advertising IDs (IDFA for iOS, GAID for Android)
"name" - Person names (first/last/full)
"social:linkedin" - LinkedIn profile, passed as either a bare slug (e.g. john-doe-070215) or full URL (e.g. https://www.linkedin.com/in/john-doe-070215/). Scheme, www., trailing slashes, and path suffixes like /details/experience are stripped automatically.

Business entities:

"name" - Company names
"phone" - Business phone numbers
"address" - Business addresses (same parsing as person addresses)
"website" - Company website or domain (e.g. https://example.com or example.com)
"social:linkedin" - LinkedIn company page, passed as either a bare slug (e.g. tennis-en-padel-shop-noord) or full URL (e.g. https://www.linkedin.com/company/tennis-en-padel-shop-noord/). Scheme, www., trailing slashes, and /about-style path suffixes are stripped automatically. Additional networks (social:<network>) may be added in the future.

Supported hash_types:

"plaintext" - Unhashed values
"md5" - MD5 hash
"sha1" - SHA-1 hash
"sha256" - SHA-256 hash

Example identifiers:

{
  "identifiers": [
    {
      "id_type": "email",
      "hash_type": "plaintext",
      "values": ["alice@example.com", "bob@example.com"]
    },
    {
      "id_type": "phone",
      "hash_type": "plaintext",
      "values": ["+15551234567"]
    }
  ]
}

format:

When set to csv, json, or jsonl, generates S3 presigned download URL
URL expires in 1 hour
Returns export metadata in response

identifier_types:

Array of contact types to return in the identifiers field
Allowed values depend on entity_type:
- "person" → "name", "email", "phone", "address", "maid", "social:linkedin"
- "business" → "name", "phone", "address", "social:linkedin", "website"
Defaults: person → ["email"], business → ["name"]
Values outside the set for the chosen entity_type are rejected
Returns actual stored contact data from the resolved entity profiles
Eliminates need for follow-up entity_enrich call to retrieve contact info

workflow_id:

Optional UUID for tracking related tool calls in a session
If not provided, a new workflow_id is generated
Used for deterministic sampling and feedback correlation

Request Schema:

interface EntityResolveParams {
  entity_type: "person" | "business";
  identifiers?: Array<{
    // person: "name" | "email" | "phone" | "address" | "maid" | "social:linkedin"
    // business: "name" | "phone" | "address" | "website" | "social:linkedin"
    id_type: string;
    hash_type: "plaintext" | "md5" | "sha1" | "sha256";
    values: string[];
  }>;
  csv_resource_uri?: string;
  lookup_columns?: {
    email?: { names: string[]; hash_type?: "plaintext" | "md5" | "sha1" | "sha256" };
    phone?: { names: string[]; hash_type?: "plaintext" | "md5" | "sha1" | "sha256" };
    address?: { names: string[] };
    name?: { names: string[] };
    linkedin?: { names: string[] };
    domain?: { names: string[] };
  };
  offset?: number;
  limit?: number;
  format?: "none" | "csv" | "json" | "jsonl";
  // person: Array<"name" | "email" | "phone" | "address" | "maid" | "social:linkedin">
  // business: Array<"name" | "phone" | "address" | "social:linkedin" | "website">
  identifier_types?: string[];
  workflow_id?: string;
}

Output Format

Success Response:

{
  entities: Array<{
    entity_id: number;
    overall_quality_score: number;
    matches: Array<{
      criterion_type: string;
      criterion_value: string;
      quality_score: number;
    }>;
    identifiers: {
      [type: string]: string[];
    };
    address?: {
      normalized_key: string;
      // latitude, longitude, distance_meters are not returned in V2 responses
    };
  }>,
  stats: {
    requested: number,
    resolved: number,
    rate: number,
    resolved_by_type: Record<string, number>
  },
  export?: {
    url: string;
    format: "csv" | "json" | "jsonl";
    rows: number;
    size_bytes: number;
    expires_at: string;
    resource_uri: string;
  },
  warnings?: Array<{ code: string; message: string }>,
  tool_trace_id: string,
  workflow_id: string
}

Response Fields:

Field	Type	Description
entities	array	Array of resolved entities grouped by entity_id
entities[].entity_id	number	Entity ID
entities[].overall_quality_score	number	Noisy-OR aggregated confidence (0-1) across all matches
entities[].matches	array	Individual criterion matches with per-criterion scores
entities[].matches[].criterion_type	string	Type (e.g., "email_plaintext", "phone_md5")
entities[].matches[].criterion_value	string	The matched value
entities[].matches[].quality_score	number	Quality score for this specific match (0-1)
entities[].identifiers	object	Stored contact data, keyed by type
entities[].address	object	Address match data (only for address queries). Contains `normalized_key`; geo coordinates are not returned in V2 responses.
stats.requested	number	Total identifier values provided across all groups
stats.resolved	number	Distinct entities matched. `rate = resolved / requested` is bounded to `[0, 1]`.
stats.rate	number	Distinct entities resolved per identifier requested
stats.resolved_by_type	object	Distinct entities matched per identifier type (e.g. `{"email": 171, "address": 226}`). Each entity contributes at most 1 per type bucket regardless of how many criteria of that type matched it.
export	object	Export metadata (only when format is csv/json/jsonl)
export.url	string	Presigned S3 download URL (expires in 1 hour)
export.resource_uri	string	Workflow resource URI for the exported file
warnings	array	Optional. Non-fatal warnings raised during the run. CSV-mode resolution emits `address_parse_low_yield` when most address values failed libpostal parsing — typically a sign that the columns under `lookup_columns.address.names` were listed in the wrong order, or that a single mapped column contains only fragments without a postcode
tool_trace_id	string	OpenTelemetry trace ID for this tool execution
workflow_id	string	Workflow session identifier

Example Response (Email Resolution):

{
  "entities": [
    {
      "entity_id": 123456,
      "overall_quality_score": 0.95,
      "matches": [
        {
          "criterion_type": "email_plaintext",
          "criterion_value": "john.doe@example.com",
          "quality_score": 0.95
        }
      ],
      "identifiers": {
        "email": ["john.doe@example.com", "jdoe@work.com"]
      }
    }
  ],
  "stats": {
    "requested": 2,
    "resolved": 1,
    "rate": 0.5,
    "resolved_by_type": { "email": 1 }
  },
  "tool_trace_id": "a1b2c3d4e5f6",
  "workflow_id": "550e8400-e29b-41d4-a716-446655440000"
}

Example Response (Address Resolution with Key-Based Matching):

{
  "entities": [
    {
      "entity_id": 789012,
      "overall_quality_score": 0.88,
      "matches": [
        {
          "criterion_type": "address_plaintext",
          "criterion_value": "123 Main St, San Francisco, CA 94105",
          "quality_score": 0.88
        }
      ],
      "identifiers": {
        "email": ["resident@example.com"]
      },
      "address": {
        "normalized_key": "123 main st san francisco ca 94105 usa"
      }
    }
  ],
  "stats": {
    "requested": 1,
    "resolved": 1,
    "rate": 1.0,
    "resolved_by_type": { "address": 1 }
  },
  "tool_trace_id": "a1b2c3d4e5f6",
  "workflow_id": "550e8400-e29b-41d4-a716-446655440000"
}

Error Handling

Common Errors:

Both identifiers and csv_resource_uri provided: "identifiers and csv_resource_uri are mutually exclusive. Provide one or the other."
Neither provided: "Either identifiers or csv_resource_uri must be provided."
csv_resource_uri without column mappings: "When using csv_resource_uri, lookup_columns must specify at least one identifier type (email, phone, address, name, linkedin, or domain) with non-empty names."
Address identifier with non-plaintext hash_type: "Address identifiers require hash_type 'plaintext' — address parsing cannot use hashed values"
Social identifier (social:linkedin, etc.) with non-plaintext hash_type: "Social identifiers require hash_type 'plaintext' — slug normalization cannot use hashed values"
Identifier type not valid for the chosen entity_type (e.g., maid for a business): "Identifier types not allowed for entity_type='business'. Allowed: name, phone, address, social:linkedin, website. Violations: identifiers[0].id_type='maid'."
More than 50 identifier groups in identifiers: "Maximum 50 identifier groups allowed."
More than 3,000 values in any identifier group: "Maximum 3000 values per identifier group."
Service temporarily unavailable: "Failed to resolve entities. Please try again or contact support if the issue persists." Carries a structured details payload with cause, hint, and workflow_id so on-call can correlate with ClickStack — see the RESOLVE_ERROR_CAUSES enum in lib/util/classifyResolveError.ts for the bounded cause set.

For files larger than 200,000 rows, paginate using the next_offset cursor returned in the response: pass it back as offset on the next call until the field is omitted (last page). offset, limit, and next_offset only apply on the CSV path; they are ignored when inline identifiers is used.

Address Matching Behavior

Addresses are parsed using libpostal into normalized components (street, city, state, zip, unit)
Matching is performed at both the street level (address_plaintext) and unit level (address_unit_plaintext)
When an input address has a unit AND the unit lookup matches at least one entity (i.e., the building has unit-precision data), entities matched for that input via the street criterion only — without a unit match for the same input — are dropped. This prevents the street-level fallback from returning the whole building when a more specific unit match is available.
For unit-bearing inputs whose unit lookup returns nothing (no unit-precision data exists for the building), the street fallback is preserved on every matched entity, with a 0.6x penalty applied to its quality score as a signal that unit precision could not be established.
Returns only the best-scoring entity(s) per input address
Household members tied at max score are all returned

CSV-mode parse-null warning: when csv_resource_uri is used with lookup_columns.address and most address values fail libpostal parsing (over half of the unique address inputs), the response includes a warnings[] entry with code address_parse_low_yield. This usually means the columns were listed in the wrong order, or that a single mapped column contains only fragments without a postcode.

List multi-column address mappings street-first — address1, address2?, city?, region?, postcode, country? — so libpostal sees a canonical address string.

The warning only fires on severe failures (>50% null parse rate). Silently depressed match rates from ordering mistakes that still parse won't trip it, so verify column order whenever the match rate is below expectation.

Usage Examples

Example 1: Simple email resolution

{
  "entity_type": "person",
  "identifiers": [
    {
      "id_type": "email",
      "hash_type": "plaintext",
      "values": ["alice@example.com", "bob@example.com"]
    }
  ]
}

Example 2: Multi-criterion (email + phone)

{
  "entity_type": "person",
  "identifiers": [
    {
      "id_type": "email",
      "hash_type": "plaintext",
      "values": ["alice@example.com"]
    },
    {
      "id_type": "phone",
      "hash_type": "plaintext",
      "values": ["+15551234567"]
    }
  ]
}

Example 3: CSV resource input

{
  "entity_type": "person",
  "csv_resource_uri": "workflow://550e8400-e29b-41d4-a716-446655440000/uploads/customers.csv",
  "lookup_columns": {
    "email": { "names": ["email"] },
    "phone": { "names": ["phone"] }
  }
}

Example 3b: CSV with pre-hashed identifiers

{
  "entity_type": "person",
  "csv_resource_uri": "workflow://550e8400-e29b-41d4-a716-446655440000/uploads/customers.csv",
  "lookup_columns": {
    "email": { "names": ["email_md5"], "hash_type": "md5" },
    "phone": { "names": ["phone_sha256"], "hash_type": "sha256" }
  }
}

Example 4: Hashed identifiers with export

{
  "entity_type": "person",
  "identifiers": [
    {
      "id_type": "email",
      "hash_type": "md5",
      "values": ["5d41402abc4b2a76b9719d911017c592"]
    }
  ],
  "format": "csv"
}

Example 5: Request specific identifier types

{
  "entity_type": "person",
  "identifiers": [
    {
      "id_type": "email",
      "hash_type": "plaintext",
      "values": ["alice@example.com"]
    }
  ],
  "identifier_types": ["email", "phone", "name"]
}

Example 6a: Resolve a person by LinkedIn profile

Full profile URL and bare slug both normalize to the same lookup key:

{
  "entity_type": "person",
  "identifiers": [
    {
      "id_type": "social:linkedin",
      "hash_type": "plaintext",
      "values": ["https://www.linkedin.com/in/john-doe-070215/"]
    }
  ],
  "identifier_types": ["email", "phone", "social:linkedin"]
}

Example 6b: Resolve a business by LinkedIn company page

Bare slug and full URL both normalize to the same match, so the following two requests are equivalent:

{
  "entity_type": "business",
  "identifiers": [
    {
      "id_type": "social:linkedin",
      "hash_type": "plaintext",
      "values": ["tennis-en-padel-shop-noord"]
    }
  ],
  "identifier_types": ["name", "website", "social:linkedin"]
}

{
  "entity_type": "business",
  "identifiers": [
    {
      "id_type": "social:linkedin",
      "hash_type": "plaintext",
      "values": ["https://www.linkedin.com/company/tennis-en-padel-shop-noord/"]
    }
  ],
  "identifier_types": ["name", "website", "social:linkedin"]
}

Example 7: Resolve a business by website domain

{
  "entity_type": "business",
  "identifiers": [
    {
      "id_type": "website",
      "hash_type": "plaintext",
      "values": ["example.com"]
    }
  ],
  "identifier_types": ["name", "website"]
}