ICEBERG
Sourcing Effectiveness Report
Adviser-grade output,
at a fraction of the cost.
Benchmarking AI-generated buyer universes against adviser-grade hand-builds
Representative example
Sell-side trade-sale mandate, services sector
Benchmark adviser
Boutique sell-side firm, 4-day hand-built buyer universe
Benchmark output
112 candidates, 80 tier-ranked, 10 actionable contacts

Iceberg in Brief

Iceberg is the execution intelligence layer for sell-side M&A and capital raise advisers. The platform compresses the four days a senior analyst spends building a buyer universe into a four-hour automated run, while preserving the qualitative bar a hand-built universe sets.

Iceberg builds a fresh buyer universe per mandate using deal-led discovery across an internal historical deals dataset, industry directories, regional registries, trade press, association lists, and company sites. The system surfaces the majority of investors and buyers PitchBook and CapIQ cover, plus a long tail those platforms miss entirely. Each mandate runs multiple buyer angles in parallel: strategic operators, services-rollup PE, corporate acquirers, family offices, and adjacent VC where relevant. Every contact carries a fit score, a sourced rationale, and a verified contact path. All generated per mandate, not pulled from a static database.

The implication for evaluators is structural. This is not a model API wrapped in an interface. It is a practitioner's playbook made operational, validated across 32 live mandates internally before it ever met an external user.

Under the hood
Defensibility, not just output Every contact in the system can be traced to a non-AI origin: a sourcing-API row, a CSV cell, or a form field. The language model proposes, a deterministic ingest pipeline validates and writes. There is no path by which a hallucinated email or phone reaches an adviser. See Annexe C.

Executive Summary

This report benchmarks Iceberg's V2 contact sourcing pipeline against a hand-built buyer universe produced by a boutique sell-side adviser on a representative trade-sale mandate, drawn from internal benchmark work across multiple live mandates. The goal: quantify, on real exercises, how close an AI-driven sourcing run gets to adviser-grade output, at what cost in time and money.

Benchmark: 4 days of senior adviser and analyst effort produced 112 candidate companies, 80 tier-ranked (16 Tier 1, balance Tier 2 and below), 10 actionable (tier-ranked plus verified email).

Test: blind run of Iceberg V2 on the same mandate, zero input from the adviser's existing work, 4 hours of unattended runtime.

Result: 93 contacts across two parallel buyer angles. 67 with verified work emails, 95% on-thesis. 11 of the adviser's named contacts surfaced independently, including 5 of his 16 Tier 1 picks. Plus 33 services-rollup PE firms with 47 partner-level contacts the adviser's list did not contain.

Floor, not ceiling. These numbers came from a single-shot V2 cold-start: no prior platform learning, no warm-path scoring, no master contact graph populated across workspaces, no adviser prompts or context, and no iterative refinement. Every mandate on the platform is designed to be re-run multiple times across its lifecycle, with progressively less input each pass. Thin feedback after each delivery (kept, rejected, asked-for-more) compounds into a sharper next run. This report measures the baseline before any of those compounding mechanisms kick in.

Headline
Iceberg V2 delivered ~75% of adviser-grade quality, 6.7x adviser-actionable contact volume, in 12.5% of the time. Roughly 50x throughput per analyst-hour at this benchmark.

For procurement and security reviewers: Annexes A through C and the "Why Not DIY" section cover data flows, sub-processors, hosting topology, and the deterministic ingest path that prevents the model from generating contact details.

Benchmark and Test

The adviser's hand-built universe (4 days)

Iceberg V2 blind run (4 hours)

Inputs: company name, sector, geography, revenue, EBITDA margin. Zero exposure to the adviser's tier list, rationales, or contact picks. No prior contacts in the workspace, no CIM, no seed list.

Method: two parallel sourcing paths (strategic operators, services-rollup PE), per-contact strategic reasoning attached to every name, 4 hours unattended.

Output: 93 contacts across 59 unique organisations (46 strategic operator contacts at 26 firms, 47 PE partner contacts at 33 firms). 67 contacts with verified work emails (72% overall, 78% on the PE side). 95% on-thesis for the trade-sale brief, no VC or UHNW noise.

V2 cold-start, compounding mechanisms not active

Re-running the same mandate after 20 adviser mandates have moved through the platform, with 15 minutes of upfront context, or with one round of feedback on this first pass, would meaningfully exceed these numbers.

Comparison Matrix

Metric Adviser hand-build Old Iceberg (rejected) Iceberg V2
Total contacts 112 (80 tiered) 100 93
With verified work email 30 (27%) 99 (99%, wrong people) 67 (72%)
Actionable contacts 10 0 67
On-thesis fit High ~10-20% (VC/UHNW noise) ~95% (clean)
Per-contact rationale Yes (paragraph) No (generic tag) Yes (scored, sourced)
PE-side coverage 13 PE-backed targets, 0 PE contacts None 33 PE firms, 47 partners
Person-level rediscovery (blind) n/a 0 11 of adviser's named contacts
Wall time 4 days ~2 hours 4 hours
Analyst hours consumed ~32 hours n/a 0 hours

Where Iceberg V2 Hits the Adviser Bar

Person-level rediscovery, not just company-level

11 of the adviser's hand-picked named contacts surfaced in the blind run, the exact individuals he had identified as the right decision-makers after his own 4 days of research. Of these:

This is a contact-selection signal, not just a sourcing signal. The pipeline's reasoning about who the right person is at each strategic buyer converged with a senior adviser's manual pick 11 times, blind.

PE-side coverage the adviser did not build

The adviser's list flagged 13 PE-backed targets but contained zero PE-side contacts. Iceberg V2 produced 33 services-rollup PE firms with 47 partner-level contacts, 78% with verified emails. The PE backer the adviser had flagged inline as the existing investor in his top Tier 1 target surfaced independently. Two further PE firms he had referenced as prior backers of his tier-ranked names also appeared organically. The remaining firms are genuine universe expansion in the buyer category most likely to actually close a trade sale at premium multiples.

Buyer-type segmentation aligned to the brief

Where Iceberg's previous output collapsed contacts into generic 'Strategic / UHNW' tags and surfaced VC and angel investors the adviser explicitly rejected, V2 segments cleanly into strategic operators and services-rollup PE. Each contact carries a scored strategic reason and a portfolio_companies field for context. This is the structural fix to the core complaint that ended the original pilot.

Where Iceberg V2 Trails, and How the Gap Closes

Organic Tier 1 hit rate at 31%

5 of the adviser's 16 Tier 1 picks surfaced at the contact level in the blind run. Several missed targets appear in PE firm portfolios elsewhere in the same output but did not surface independently in the operator-side path.

How this gap closes: three compounding mechanisms tighten Tier 1 hit rate from the V2 baseline. First, the intended workflow has the adviser upload the CIM, key contacts, and any existing target list at mandate setup. The blind run benchmarks the cold-start ceiling, not the production ceiling. Second, the master contact graph accumulates across every mandate run on the platform, so each new adviser tightens the coverage map for every other adviser working in adjacent verticals. Third, every mandate is designed to be re-run with progressively lighter input. Thin feedback after each delivery (kept, rejected, asked-for-more) compounds into a sharper next pass, with the same mandate typically run three to five times across its lifecycle. With CIM seeding, modest platform tenure, and one round of iterative refinement, Tier 1 hit rate closes toward 90%+ in production.

Email coverage on operator-side founders at 61%

Acceptable on a cold run. Founder and CEO emails at privately-held offshore-services businesses are the hardest pattern to enrich. Coverage on the services-rollup PE side ran at 78%.

Throughput and Cost Math

The core economic case for Iceberg-assisted sourcing on a single mandate:

Dimension Adviser hand-build Iceberg V2 Ratio
Wall time 4 days 4 hours 12.5%
Senior analyst hours ~32 hours 0 hours n/a
Actionable contacts produced 10 67 6.7x
Cost per mandate AUD 4,800 to 8,000 $500 per 150 contacts ~5% of adviser cost
Effective analyst throughput 0.31 contacts/hour Unattended ~50x at this benchmark

Iceberg replaces the 4-day analyst grind with a 4-hour unattended run that delivers 6.7x the actionable volume at roughly 5% of the adviser cost base, while preserving ~75% of the qualitative bar a hand-built universe sets. Closing the remaining 25% takes a 20-minute upload of the adviser's CIM and seed contacts, the intended production workflow.

Why Not DIY: Claude Plus a Database

The natural question for any well-resourced firm is whether an analyst with a PitchBook seat and an LLM subscription could produce the same output. The answer is no, and the reason is not the LLM. It is the infrastructure underneath: deal-led discovery, multi-provider verification, persistent memory, and the audit trail required to put an adviser's name behind the contact.

Dimension DIY (Claude + PitchBook or Grata) Iceberg
Universe construction Static database filters, analyst prompts the LLM over CSV exports Deal-led discovery across registries, trade press, association lists, and historical-deal datasets, run as parallel buyer-angle searches per mandate
Contact verification Manual, varies by analyst and chosen enrichment tools Multi-provider verification waterfall, emails and phones are never generated by the model
Memory across mandates None, every deal starts cold Master contact graph keyed by email and LinkedIn URL, cross-mandate signals carry forward
Analyst time per mandate 30 to 40 hours for prompting, deduping, enriching, and CRM formatting Zero hours to run, 4-hour unattended pipeline straight to CRM
Per-contact audit trail None Source, rationale, and score recorded against every contact
Governance posture Ad hoc, varies by analyst and project Workspace-scoped tenancy, Sydney-region inference, contractual no-train terms with the model provider

DIY assembles each of these as separate workflows, glued together by an analyst. Iceberg ships them as one system, with verification, audit, and governance built in by default. The cost difference is not in the LLM bill. It is in the analyst hours saved, the verification rework avoided, and the procurement comfort that comes from contact data with full provenance.

The Sourcing Engine

Every component below exists to surface buyers and investors an adviser could not realistically find in a 4-day analyst hunt. The system is built around deal-led discovery, multi-angle orchestration, multi-provider verification, and explicit score-and-rationale filtering before contacts reach the adviser.

Multiple angles per mandate, run in parallel

Each mandate is decomposed into the buyer categories most likely to close it at premium multiples. A trade sale might run strategic-operator and services-rollup-PE discovery in parallel; a capital raise might run sector-thesis VC, family office, and growth-PE paths simultaneously. The architecture also supports multiple mandates running in parallel on the platform at any given time. A firm with twenty live mandates can have all twenty universes building simultaneously, each tuned to its own deal, with no contention across runs.

Under the hood
Concurrency by design Long-running sourcing work is dispatched to dedicated async queues. Twenty parallel mandates run as twenty independent jobs against the same workspace-scoped API. No code change required to scale horizontally. See Annexe A.

Data sources and refresh cadence

For each mandate, the system searches for organisations that have done similar deals in the recent past, drawing on an internal historical deals dataset plus searches across industry directories, regional registries, trade-press coverage, association lists, and company websites. For investor mandates, the search walks from those companies back to the investors who wrote the cheques.

Coverage vs PitchBook or CapIQ: the majority of named deals and investors those platforms cover, plus a long tail they miss entirely. The owner of a forklift rental business does not sit in PitchBook; a deal-led search finds them.

Refresh cadence is per-mandate, not scheduled. Each new brief triggers a fresh search rather than pulling from a static database. Contacts older than 90 days are re-enriched against live providers before delivery. Every email, phone, LinkedIn URL, and employment entry comes from a contact-data provider with a deliverability status attached. There is no facility for the AI to hallucinate a contact detail.

Under the hood
Why a static database export plus a chat window does not match this A PitchBook seat and an LLM subscription give an analyst an index and a prompt box. Deal-led discovery, the parallel multi-angle orchestration, the verification waterfall, and the workspace-scoped audit trail are separate pieces of infrastructure that have to exist before the output is procurement-grade. See "Why Not DIY" above for the full comparison.

Methodology, dedup, and confidence scoring

Each mandate runs through five stages: organisation discovery anchored on adjacent-deal companies, then people discovery by title/role, then multi-provider enrichment (work email, phone, LinkedIn, employment history), then model scoring (0-100 against the mandate rubric, with rationale), then filter and deliver. Contacts below a hard score floor of 50 are not surfaced. Above the floor, all contacts up to the requested cap are delivered with score and rationale visible.

Adviser-side dedup is handled by passing existing contacts to the API as a suppression list, so prior outreach and known names are filtered out before delivery. The model scores over structured records that have already come back from a verification step. It never produces a contact detail itself.

Under the hood
The LLM proposes, the schema disposes Every contact write goes through one of three deterministic paths (sourcing-API ingest, CSV import, manual entry), all sharing the same validators and dedup helpers. Emails are lowercased and regex-checked; phones are normalised to digits-only for dedup; malformed values are dropped. The agent tool catalogue has no function whose job is to generate an email or phone, and a schema gates every tool call. Same input, same output, every time. Full contract in Annexe C.

Data Sovereignty

Procurement reviewers ask, correctly, exactly what data leaves the adviser's control and which third parties touch it. The complete answer is below; further detail in Annexes A and B.

What the adviser submits to the sourcing API

Required: buyer type, geography list, natural-language brief on the ideal investor or buyer (20+ chars), and a mandate-context object describing the company being transacted. Optional: structured filters (sector, stage, transaction type), title filters, organisation criteria, suppression lists for known organisations or contacts, seed contacts, advisor-context free text, and a delivery cap (default 30, max 150).

Anonymisation option

Iceberg can anonymise the payload, passing a generic company name and a sanitised company description, and still run a successful deep search. Deal characteristics (sector, geography, transaction type, target profile) are what drive sourcing.

What touches the model provider

Mandate brief and per-contact structured payload are sent to an enterprise LLM provider for classification, query refinement, and contact scoring. The model operates under enterprise data-processing terms: customer inputs are not used to train the model (contractual), region-pinned to Sydney, TLS in transit, AES-256 at rest. The model sees the structured payload only, never the verified email address or phone number.

Hosting

Main platform (adviser-facing CRM, contact store, AI agent layer, all customer data) sits in AWS ap-southeast-2 (Sydney). Inference runs through an enterprise LLM in the same Sydney region. Some legacy components of the sourcing pipeline currently sit on overseas infrastructure; full migration to onshore residency is in progress and scheduled ahead of any procurement-driven rollout. Detail in Annexe B.

Annexes
Technical detail for engineering, security, and procurement reviewers.

Annexe A · Platform Architecture

Audience: engineering lead, infosec architect, or CTO doing diligence. Goal: enough detail to trust the stack and run a trial. Deliberately silent on the matching, ranking, and agent-orchestration logic that makes the product differentiated; available under NDA.

A.1 System context and trust boundaries

A.2 Stack at a glance

LayerTech
FrontendNext.js 14 (App Router), NextAuth, React Query, Tailwind
APINestJS 10, TypeScript, class-validator. 26 feature modules. JWT global guard, role and workspace decorators, OpenAPI exposed at /api.
PersistencePostgreSQL 16 via TypeORM. Single primary, 28 entities, ~120 versioned migrations. synchronize: false; schema only changes via migrations.
Cache / queuesRedis 7 + BullMQ async workers (email sync, sequence sends, follow-ups, embeddings, sourcing polling).
Object storageAWS S3. Access brokered by the API.
AI orchestrationAgent runtime hosted inside the API process. Tools call the same workspace-scoped repositories the REST API uses; no shadow data path. Traces export to a managed observability platform.
Sourcing clientExternal deep-search API client. Results land in dedicated tables and are promoted into workspace contacts under adviser review.
EmailResend (transactional), Gmail / Outlook OAuth (user-scoped). Outbound and inbound both go through the API.

A.3 AWS topology (ap-southeast-2 Sydney)

A.4 Authentication and tenancy

A.5 Where AI sits in the stack

The AI orchestration layer is hosted inside the NestJS process. One process to log, throttle, redact, and observe. AI tools call the same workspace-scoped repositories the REST API uses, so there is no shadow data path. Conversation memory persists in a dedicated schema in the same DB with the same backup, encryption, and region. Model traces (model, tool calls, latency, token counts) export to managed observability.

Model calls go through an enterprise LLM endpoint pinned to australia-southeast1 (Sydney). The choice of enterprise endpoint (rather than the consumer API) is specifically for the data-handling contract: prompt and response data not used to train foundation models, region-pinned, TLS in transit, enterprise DPA covers data processing. The request leaves the AWS VPC for a managed inference endpoint but never leaves the Sydney metro region.

Agent tools split into retrieval tools (read from our DB, the sourcing API, or live web search; returns include citations) and mutation tools (write through the same NestJS services as the REST API, so all the same validation, normalisation, and dedup apply).

A.6 Deploy and release

A.7 Out of scope for this document

The contact-to-buyer matching model, the agent orchestration logic, and the scoring weights are not described here. Available under NDA.

Annexe B · Security Posture

Audience: procurement or InfoSec reviewer working through a checklist. Each section names the control and the implementation today; code references and IaC walkthroughs available on request.

B.1 At a glance

TopicStatus
Data sovereigntyCustomer data tier in AWS ap-southeast-2 (Sydney). LLM inference pinned to Sydney. Some legacy sourcing components currently offshore; full migration to onshore in progress.
HostingAWS (ECS Fargate + RDS + ElastiCache + S3). IaC via Pulumi.
AuthenticationNestJS JWT global guard, bcrypt password hashing, NextAuth on the frontend.
Multi-tenancyWorkspace-scoped data model enforced in code on every query.
AuthorisationRole decorators on every controller.
TLSTLS 1.2+ at the ALB, ACM certificates, HTTP-to-HTTPS 301.
Encryption at restRDS, ElastiCache, S3 all encrypted with AWS-managed KMS keys.
Network isolationApp + data tier in private subnets. Ingress only via ALB.
SecretsSSM Parameter Store SecureString (KMS). No secrets in source control.
LoggingECS to CloudWatch; exception capture to PostHog.
BackupsAutomated RDS snapshots. Redis snapshotting on production.
Source controlPrivate GitHub repo. CI deploys via GitHub Actions.

B.2 Authentication and authorisation

Result: no controller can be reached by an anonymous user unless explicitly @Public(), and no workspace data by a user not a member of that workspace.

B.3 Transport, encryption, network

B.4 Secrets and supply chain

B.5 Logging, monitoring, audit

B.6 Backups and disaster recovery

B.7 Sub-processors

These third parties receive request payloads or store user data on Iceberg's behalf. All are reachable from the API only, behind authenticated tools. No client-side calls.

VendorData flowPurpose
AWS (ap-southeast-2)All customer dataPrimary hosting (compute, DB, cache, storage, secrets)
CloudflareDNS + edge TLSDNS and CDN
Enterprise LLM (Sydney region)Prompt text including excerpts of adviser-supplied content; responses returned to APIInference. Enterprise terms: prompt and response data not used to train foundation models. Region-pinned. TLS in transit, AES-256 at rest. Enterprise DPA covers data processing.
Sourcing provider (deep-search API)Mandate parameters; receives back investor recordsInvestor sourcing pipeline
Web search providerSearch queries derived from adviser intentReal-time web search citations
ResendOutbound email content + recipient addressTransactional + system email delivery
Gmail / Outlook (per-user OAuth)User's own inbox under user-granted scopesOptional inbox sync for the adviser
Observability platformLLM request/response traces, token countsObservability of AI calls
PostHogProduct analytics + exception payloadsAnalytics + error monitoring

Formal sub-processor disclosure with vendor security pages and DPAs available on request.

B.8 OWASP coverage

B.9 Access to production and incident response

B.10 Known gaps and roadmap

Surfaced transparently because procurement will ask:

Annexe C · Deterministic Contact Ingest

The question that comes up at every meeting: "How do you stop the model making up emails and phone numbers?"

Short answer: the model is not asked to. Contact data flows through a deterministic ingest pipeline (external sourcing API, CSV import, or user-entered fields) and lands in the database via the same validation and dedup helpers in every case. The LLM can decide which contact to act on, but it cannot author an email address or phone number into the system.

C.1 The three (and only three) paths

PathSource of truthDeterminism mechanism
Sourcing runExternal deep-search API (mandate-driven)API returns structured rows. Upsert into GlobalContact, then promoted into the workspace as Contact.
CSV / spreadsheet importThe adviser's fileParsed row by row. Values run through normalisers before insert.
Manual entryAdviser types into the CRM formSame DTOs, same validators, same normalisers as import.

There is no fourth path. The AI does not invent contact records. The agents call the same workspace-scoped services the REST API uses, with the same schemas and the same guards.

C.2 The validation boundary

Every write of an email or phone goes through the same two helpers:

In parallel, ~25 enum-like normalisers (geography, investor type, check size, priority, etc.) are contract-bound to return null on no-match rather than throw. A bad row never poisons a batch import; the caller decides whether to default or reject.

C.3 Provenance

Each Contact records how it got there: source (USER, SOURCED, or GLOBAL), sourceGlobalContactId back-reference, lastGlobalSyncAt timestamp, and per-deal aiRanking + aiReasoning. Signals (news mentions, events) carry their source URL. The canonical GlobalContact stores the raw upstream payload so we can re-parse later without going back to the provider. A reviewer can answer 'where did this email come from?' by inspecting the chain. No audit dead-ends.

C.4 What the LLM can and cannot do

Agents are wired to a fixed catalogue of tools. Each tool has a schema-validated input. The model can choose a tool and propose arguments; the schema validates before the tool body runs, and the tool body calls back into the same NestJS service the REST API uses.

Tool familyCan the model invent contact PII?
Retrieval (research, sourcing, web search)No. Pulls from external providers or our DB. Returns include citations and source URLs. Extraction temperature 0.2 (near-deterministic).
Mutation (create, update)The model can propose values, but they flow through the same DTO + validator + dual-write path as a manual form submission. Malformed values dropped, duplicates collapsed.
PII generationNo tool with this purpose exists in the catalogue.

Agent prompts also enforce the contract: research first, present, confirm, then write. A row without confirmed provenance does not get written.

C.5 Schema-bound outputs and idempotency

Structured model outputs (pitch-deck summaries, contact insights) use a schema as the response format. The output shape is fixed; the model cannot add unknown fields, and if it fails the schema the call fails loudly rather than silently writing a garbage row.

Running the same operation twice does not duplicate rows. Race-safe upserts on global organisations and people (unique normalised-name index, compound lookups on LinkedIn URL, email, name). Bring-in is idempotent on (workspaceId, globalContactId). CSV import surfaces specific row conflicts rather than silently double-writing.

C.6 What determinism does not mean

What we guarantee, plainly: