Iceberg is the execution intelligence layer for sell-side M&A and capital raise advisers. The platform compresses the four days a senior analyst spends building a buyer universe into a four-hour automated run, while preserving the qualitative bar a hand-built universe sets.
Iceberg builds a fresh buyer universe per mandate using deal-led discovery across an internal historical deals dataset, industry directories, regional registries, trade press, association lists, and company sites. The system surfaces the majority of investors and buyers PitchBook and CapIQ cover, plus a long tail those platforms miss entirely. Each mandate runs multiple buyer angles in parallel: strategic operators, services-rollup PE, corporate acquirers, family offices, and adjacent VC where relevant. Every contact carries a fit score, a sourced rationale, and a verified contact path. All generated per mandate, not pulled from a static database.
The implication for evaluators is structural. This is not a model API wrapped in an interface. It is a practitioner's playbook made operational, validated across 32 live mandates internally before it ever met an external user.
This report benchmarks Iceberg's V2 contact sourcing pipeline against a hand-built buyer universe produced by a boutique sell-side adviser on a representative trade-sale mandate, drawn from internal benchmark work across multiple live mandates. The goal: quantify, on real exercises, how close an AI-driven sourcing run gets to adviser-grade output, at what cost in time and money.
Benchmark: 4 days of senior adviser and analyst effort produced 112 candidate companies, 80 tier-ranked (16 Tier 1, balance Tier 2 and below), 10 actionable (tier-ranked plus verified email).
Test: blind run of Iceberg V2 on the same mandate, zero input from the adviser's existing work, 4 hours of unattended runtime.
Result: 93 contacts across two parallel buyer angles. 67 with verified work emails, 95% on-thesis. 11 of the adviser's named contacts surfaced independently, including 5 of his 16 Tier 1 picks. Plus 33 services-rollup PE firms with 47 partner-level contacts the adviser's list did not contain.
Floor, not ceiling. These numbers came from a single-shot V2 cold-start: no prior platform learning, no warm-path scoring, no master contact graph populated across workspaces, no adviser prompts or context, and no iterative refinement. Every mandate on the platform is designed to be re-run multiple times across its lifecycle, with progressively less input each pass. Thin feedback after each delivery (kept, rejected, asked-for-more) compounds into a sharper next run. This report measures the baseline before any of those compounding mechanisms kick in.
For procurement and security reviewers: Annexes A through C and the "Why Not DIY" section cover data flows, sub-processors, hosting topology, and the deterministic ingest path that prevents the model from generating contact details.
Inputs: company name, sector, geography, revenue, EBITDA margin. Zero exposure to the adviser's tier list, rationales, or contact picks. No prior contacts in the workspace, no CIM, no seed list.
Method: two parallel sourcing paths (strategic operators, services-rollup PE), per-contact strategic reasoning attached to every name, 4 hours unattended.
Output: 93 contacts across 59 unique organisations (46 strategic operator contacts at 26 firms, 47 PE partner contacts at 33 firms). 67 contacts with verified work emails (72% overall, 78% on the PE side). 95% on-thesis for the trade-sale brief, no VC or UHNW noise.
Re-running the same mandate after 20 adviser mandates have moved through the platform, with 15 minutes of upfront context, or with one round of feedback on this first pass, would meaningfully exceed these numbers.
| Metric | Adviser hand-build | Old Iceberg (rejected) | Iceberg V2 |
|---|---|---|---|
| Total contacts | 112 (80 tiered) | 100 | 93 |
| With verified work email | 30 (27%) | 99 (99%, wrong people) | 67 (72%) |
| Actionable contacts | 10 | 0 | 67 |
| On-thesis fit | High | ~10-20% (VC/UHNW noise) | ~95% (clean) |
| Per-contact rationale | Yes (paragraph) | No (generic tag) | Yes (scored, sourced) |
| PE-side coverage | 13 PE-backed targets, 0 PE contacts | None | 33 PE firms, 47 partners |
| Person-level rediscovery (blind) | n/a | 0 | 11 of adviser's named contacts |
| Wall time | 4 days | ~2 hours | 4 hours |
| Analyst hours consumed | ~32 hours | n/a | 0 hours |
11 of the adviser's hand-picked named contacts surfaced in the blind run, the exact individuals he had identified as the right decision-makers after his own 4 days of research. Of these:
This is a contact-selection signal, not just a sourcing signal. The pipeline's reasoning about who the right person is at each strategic buyer converged with a senior adviser's manual pick 11 times, blind.
The adviser's list flagged 13 PE-backed targets but contained zero PE-side contacts. Iceberg V2 produced 33 services-rollup PE firms with 47 partner-level contacts, 78% with verified emails. The PE backer the adviser had flagged inline as the existing investor in his top Tier 1 target surfaced independently. Two further PE firms he had referenced as prior backers of his tier-ranked names also appeared organically. The remaining firms are genuine universe expansion in the buyer category most likely to actually close a trade sale at premium multiples.
Where Iceberg's previous output collapsed contacts into generic 'Strategic / UHNW' tags and surfaced VC and angel investors the adviser explicitly rejected, V2 segments cleanly into strategic operators and services-rollup PE. Each contact carries a scored strategic reason and a portfolio_companies field for context. This is the structural fix to the core complaint that ended the original pilot.
5 of the adviser's 16 Tier 1 picks surfaced at the contact level in the blind run. Several missed targets appear in PE firm portfolios elsewhere in the same output but did not surface independently in the operator-side path.
How this gap closes: three compounding mechanisms tighten Tier 1 hit rate from the V2 baseline. First, the intended workflow has the adviser upload the CIM, key contacts, and any existing target list at mandate setup. The blind run benchmarks the cold-start ceiling, not the production ceiling. Second, the master contact graph accumulates across every mandate run on the platform, so each new adviser tightens the coverage map for every other adviser working in adjacent verticals. Third, every mandate is designed to be re-run with progressively lighter input. Thin feedback after each delivery (kept, rejected, asked-for-more) compounds into a sharper next pass, with the same mandate typically run three to five times across its lifecycle. With CIM seeding, modest platform tenure, and one round of iterative refinement, Tier 1 hit rate closes toward 90%+ in production.
Acceptable on a cold run. Founder and CEO emails at privately-held offshore-services businesses are the hardest pattern to enrich. Coverage on the services-rollup PE side ran at 78%.
The core economic case for Iceberg-assisted sourcing on a single mandate:
| Dimension | Adviser hand-build | Iceberg V2 | Ratio |
|---|---|---|---|
| Wall time | 4 days | 4 hours | 12.5% |
| Senior analyst hours | ~32 hours | 0 hours | n/a |
| Actionable contacts produced | 10 | 67 | 6.7x |
| Cost per mandate | AUD 4,800 to 8,000 | $500 per 150 contacts | ~5% of adviser cost |
| Effective analyst throughput | 0.31 contacts/hour | Unattended | ~50x at this benchmark |
Iceberg replaces the 4-day analyst grind with a 4-hour unattended run that delivers 6.7x the actionable volume at roughly 5% of the adviser cost base, while preserving ~75% of the qualitative bar a hand-built universe sets. Closing the remaining 25% takes a 20-minute upload of the adviser's CIM and seed contacts, the intended production workflow.
The natural question for any well-resourced firm is whether an analyst with a PitchBook seat and an LLM subscription could produce the same output. The answer is no, and the reason is not the LLM. It is the infrastructure underneath: deal-led discovery, multi-provider verification, persistent memory, and the audit trail required to put an adviser's name behind the contact.
| Dimension | DIY (Claude + PitchBook or Grata) | Iceberg |
|---|---|---|
| Universe construction | Static database filters, analyst prompts the LLM over CSV exports | Deal-led discovery across registries, trade press, association lists, and historical-deal datasets, run as parallel buyer-angle searches per mandate |
| Contact verification | Manual, varies by analyst and chosen enrichment tools | Multi-provider verification waterfall, emails and phones are never generated by the model |
| Memory across mandates | None, every deal starts cold | Master contact graph keyed by email and LinkedIn URL, cross-mandate signals carry forward |
| Analyst time per mandate | 30 to 40 hours for prompting, deduping, enriching, and CRM formatting | Zero hours to run, 4-hour unattended pipeline straight to CRM |
| Per-contact audit trail | None | Source, rationale, and score recorded against every contact |
| Governance posture | Ad hoc, varies by analyst and project | Workspace-scoped tenancy, Sydney-region inference, contractual no-train terms with the model provider |
DIY assembles each of these as separate workflows, glued together by an analyst. Iceberg ships them as one system, with verification, audit, and governance built in by default. The cost difference is not in the LLM bill. It is in the analyst hours saved, the verification rework avoided, and the procurement comfort that comes from contact data with full provenance.
Every component below exists to surface buyers and investors an adviser could not realistically find in a 4-day analyst hunt. The system is built around deal-led discovery, multi-angle orchestration, multi-provider verification, and explicit score-and-rationale filtering before contacts reach the adviser.
Each mandate is decomposed into the buyer categories most likely to close it at premium multiples. A trade sale might run strategic-operator and services-rollup-PE discovery in parallel; a capital raise might run sector-thesis VC, family office, and growth-PE paths simultaneously. The architecture also supports multiple mandates running in parallel on the platform at any given time. A firm with twenty live mandates can have all twenty universes building simultaneously, each tuned to its own deal, with no contention across runs.
For each mandate, the system searches for organisations that have done similar deals in the recent past, drawing on an internal historical deals dataset plus searches across industry directories, regional registries, trade-press coverage, association lists, and company websites. For investor mandates, the search walks from those companies back to the investors who wrote the cheques.
Coverage vs PitchBook or CapIQ: the majority of named deals and investors those platforms cover, plus a long tail they miss entirely. The owner of a forklift rental business does not sit in PitchBook; a deal-led search finds them.
Refresh cadence is per-mandate, not scheduled. Each new brief triggers a fresh search rather than pulling from a static database. Contacts older than 90 days are re-enriched against live providers before delivery. Every email, phone, LinkedIn URL, and employment entry comes from a contact-data provider with a deliverability status attached. There is no facility for the AI to hallucinate a contact detail.
Each mandate runs through five stages: organisation discovery anchored on adjacent-deal companies, then people discovery by title/role, then multi-provider enrichment (work email, phone, LinkedIn, employment history), then model scoring (0-100 against the mandate rubric, with rationale), then filter and deliver. Contacts below a hard score floor of 50 are not surfaced. Above the floor, all contacts up to the requested cap are delivered with score and rationale visible.
Adviser-side dedup is handled by passing existing contacts to the API as a suppression list, so prior outreach and known names are filtered out before delivery. The model scores over structured records that have already come back from a verification step. It never produces a contact detail itself.
Procurement reviewers ask, correctly, exactly what data leaves the adviser's control and which third parties touch it. The complete answer is below; further detail in Annexes A and B.
Required: buyer type, geography list, natural-language brief on the ideal investor or buyer (20+ chars), and a mandate-context object describing the company being transacted. Optional: structured filters (sector, stage, transaction type), title filters, organisation criteria, suppression lists for known organisations or contacts, seed contacts, advisor-context free text, and a delivery cap (default 30, max 150).
Iceberg can anonymise the payload, passing a generic company name and a sanitised company description, and still run a successful deep search. Deal characteristics (sector, geography, transaction type, target profile) are what drive sourcing.
Mandate brief and per-contact structured payload are sent to an enterprise LLM provider for classification, query refinement, and contact scoring. The model operates under enterprise data-processing terms: customer inputs are not used to train the model (contractual), region-pinned to Sydney, TLS in transit, AES-256 at rest. The model sees the structured payload only, never the verified email address or phone number.
Main platform (adviser-facing CRM, contact store, AI agent layer, all customer data) sits in AWS ap-southeast-2 (Sydney). Inference runs through an enterprise LLM in the same Sydney region. Some legacy components of the sourcing pipeline currently sit on overseas infrastructure; full migration to onshore residency is in progress and scheduled ahead of any procurement-driven rollout. Detail in Annexe B.
Audience: engineering lead, infosec architect, or CTO doing diligence. Goal: enough detail to trust the stack and run a trial. Deliberately silent on the matching, ranking, and agent-orchestration logic that makes the product differentiated; available under NDA.
| Layer | Tech |
|---|---|
| Frontend | Next.js 14 (App Router), NextAuth, React Query, Tailwind |
| API | NestJS 10, TypeScript, class-validator. 26 feature modules. JWT global guard, role and workspace decorators, OpenAPI exposed at /api. |
| Persistence | PostgreSQL 16 via TypeORM. Single primary, 28 entities, ~120 versioned migrations. synchronize: false; schema only changes via migrations. |
| Cache / queues | Redis 7 + BullMQ async workers (email sync, sequence sends, follow-ups, embeddings, sourcing polling). |
| Object storage | AWS S3. Access brokered by the API. |
| AI orchestration | Agent runtime hosted inside the API process. Tools call the same workspace-scoped repositories the REST API uses; no shadow data path. Traces export to a managed observability platform. |
| Sourcing client | External deep-search API client. Results land in dedicated tables and are promoted into workspace contacts under adviser review. |
| Resend (transactional), Gmail / Outlook OAuth (user-scoped). Outbound and inbound both go through the API. |
The AI orchestration layer is hosted inside the NestJS process. One process to log, throttle, redact, and observe. AI tools call the same workspace-scoped repositories the REST API uses, so there is no shadow data path. Conversation memory persists in a dedicated schema in the same DB with the same backup, encryption, and region. Model traces (model, tool calls, latency, token counts) export to managed observability.
Model calls go through an enterprise LLM endpoint pinned to australia-southeast1 (Sydney). The choice of enterprise endpoint (rather than the consumer API) is specifically for the data-handling contract: prompt and response data not used to train foundation models, region-pinned, TLS in transit, enterprise DPA covers data processing. The request leaves the AWS VPC for a managed inference endpoint but never leaves the Sydney metro region.
Agent tools split into retrieval tools (read from our DB, the sourcing API, or live web search; returns include citations) and mutation tools (write through the same NestJS services as the REST API, so all the same validation, normalisation, and dedup apply).
The contact-to-buyer matching model, the agent orchestration logic, and the scoring weights are not described here. Available under NDA.
Audience: procurement or InfoSec reviewer working through a checklist. Each section names the control and the implementation today; code references and IaC walkthroughs available on request.
| Topic | Status |
|---|---|
| Data sovereignty | Customer data tier in AWS ap-southeast-2 (Sydney). LLM inference pinned to Sydney. Some legacy sourcing components currently offshore; full migration to onshore in progress. |
| Hosting | AWS (ECS Fargate + RDS + ElastiCache + S3). IaC via Pulumi. |
| Authentication | NestJS JWT global guard, bcrypt password hashing, NextAuth on the frontend. |
| Multi-tenancy | Workspace-scoped data model enforced in code on every query. |
| Authorisation | Role decorators on every controller. |
| TLS | TLS 1.2+ at the ALB, ACM certificates, HTTP-to-HTTPS 301. |
| Encryption at rest | RDS, ElastiCache, S3 all encrypted with AWS-managed KMS keys. |
| Network isolation | App + data tier in private subnets. Ingress only via ALB. |
| Secrets | SSM Parameter Store SecureString (KMS). No secrets in source control. |
| Logging | ECS to CloudWatch; exception capture to PostHog. |
| Backups | Automated RDS snapshots. Redis snapshotting on production. |
| Source control | Private GitHub repo. CI deploys via GitHub Actions. |
Result: no controller can be reached by an anonymous user unless explicitly @Public(), and no workspace data by a user not a member of that workspace.
These third parties receive request payloads or store user data on Iceberg's behalf. All are reachable from the API only, behind authenticated tools. No client-side calls.
| Vendor | Data flow | Purpose |
|---|---|---|
| AWS (ap-southeast-2) | All customer data | Primary hosting (compute, DB, cache, storage, secrets) |
| Cloudflare | DNS + edge TLS | DNS and CDN |
| Enterprise LLM (Sydney region) | Prompt text including excerpts of adviser-supplied content; responses returned to API | Inference. Enterprise terms: prompt and response data not used to train foundation models. Region-pinned. TLS in transit, AES-256 at rest. Enterprise DPA covers data processing. |
| Sourcing provider (deep-search API) | Mandate parameters; receives back investor records | Investor sourcing pipeline |
| Web search provider | Search queries derived from adviser intent | Real-time web search citations |
| Resend | Outbound email content + recipient address | Transactional + system email delivery |
| Gmail / Outlook (per-user OAuth) | User's own inbox under user-granted scopes | Optional inbox sync for the adviser |
| Observability platform | LLM request/response traces, token counts | Observability of AI calls |
| PostHog | Product analytics + exception payloads | Analytics + error monitoring |
Formal sub-processor disclosure with vendor security pages and DPAs available on request.
Surfaced transparently because procurement will ask:
The question that comes up at every meeting: "How do you stop the model making up emails and phone numbers?"
Short answer: the model is not asked to. Contact data flows through a deterministic ingest pipeline (external sourcing API, CSV import, or user-entered fields) and lands in the database via the same validation and dedup helpers in every case. The LLM can decide which contact to act on, but it cannot author an email address or phone number into the system.
| Path | Source of truth | Determinism mechanism |
|---|---|---|
| Sourcing run | External deep-search API (mandate-driven) | API returns structured rows. Upsert into GlobalContact, then promoted into the workspace as Contact. |
| CSV / spreadsheet import | The adviser's file | Parsed row by row. Values run through normalisers before insert. |
| Manual entry | Adviser types into the CRM form | Same DTOs, same validators, same normalisers as import. |
There is no fourth path. The AI does not invent contact records. The agents call the same workspace-scoped services the REST API uses, with the same schemas and the same guards.
Every write of an email or phone goes through the same two helpers:
In parallel, ~25 enum-like normalisers (geography, investor type, check size, priority, etc.) are contract-bound to return null on no-match rather than throw. A bad row never poisons a batch import; the caller decides whether to default or reject.
Each Contact records how it got there: source (USER, SOURCED, or GLOBAL), sourceGlobalContactId back-reference, lastGlobalSyncAt timestamp, and per-deal aiRanking + aiReasoning. Signals (news mentions, events) carry their source URL. The canonical GlobalContact stores the raw upstream payload so we can re-parse later without going back to the provider. A reviewer can answer 'where did this email come from?' by inspecting the chain. No audit dead-ends.
Agents are wired to a fixed catalogue of tools. Each tool has a schema-validated input. The model can choose a tool and propose arguments; the schema validates before the tool body runs, and the tool body calls back into the same NestJS service the REST API uses.
| Tool family | Can the model invent contact PII? |
|---|---|
| Retrieval (research, sourcing, web search) | No. Pulls from external providers or our DB. Returns include citations and source URLs. Extraction temperature 0.2 (near-deterministic). |
| Mutation (create, update) | The model can propose values, but they flow through the same DTO + validator + dual-write path as a manual form submission. Malformed values dropped, duplicates collapsed. |
| PII generation | No tool with this purpose exists in the catalogue. |
Agent prompts also enforce the contract: research first, present, confirm, then write. A row without confirmed provenance does not get written.
Structured model outputs (pitch-deck summaries, contact insights) use a schema as the response format. The output shape is fixed; the model cannot add unknown fields, and if it fails the schema the call fails loudly rather than silently writing a garbage row.
Running the same operation twice does not duplicate rows. Race-safe upserts on global organisations and people (unique normalised-name index, compound lookups on LinkedIn URL, email, name). Bring-in is idempotent on (workspaceId, globalContactId). CSV import surfaces specific row conflicts rather than silently double-writing.
What we guarantee, plainly: