ICEBERG

Sourcing Effectiveness Report

Adviser-grade output,
at a fraction of the cost.

Benchmarking AI-generated buyer universes against adviser-grade hand-builds

Representative example

Sell-side trade-sale mandate, services sector

Benchmark adviser

Boutique sell-side firm, 4-day hand-built buyer universe

Benchmark output

112 candidates, 80 tier-ranked, 10 actionable contacts

Iceberg in Brief

Iceberg is the execution intelligence layer for sell-side M&A and capital raise advisers. The platform compresses the four days a senior analyst spends building a buyer universe into a four-hour automated run, while preserving the qualitative bar a hand-built universe sets.

Iceberg builds a fresh buyer universe per mandate using deal-led discovery across an internal historical deals dataset, industry directories, regional registries, trade press, association lists, and company sites. The system surfaces the majority of investors and buyers PitchBook and CapIQ cover, plus a long tail those platforms miss entirely. Each mandate runs multiple buyer angles in parallel: strategic operators, services-rollup PE, corporate acquirers, family offices, and adjacent VC where relevant. Every contact carries a fit score, a sourced rationale, and a verified contact path. All generated per mandate, not pulled from a static database.

The implication for evaluators is structural. This is not a model API wrapped in an interface. It is a practitioner's playbook made operational, validated across 32 live mandates internally before it ever met an external user.

Under the hood

Defensibility, not just output Every contact in the system can be traced to a non-AI origin: a sourcing-API row, a CSV cell, or a form field. The language model proposes, a deterministic ingest pipeline validates and writes. There is no path by which a hallucinated email or phone reaches an adviser. See Annexe C.

Executive Summary

This report benchmarks Iceberg's V2 contact sourcing pipeline against a hand-built buyer universe produced by a boutique sell-side adviser on a representative trade-sale mandate, drawn from internal benchmark work across multiple live mandates. The goal: quantify, on real exercises, how close an AI-driven sourcing run gets to adviser-grade output, at what cost in time and money.

Benchmark: 4 days of senior adviser and analyst effort produced 112 candidate companies, 80 tier-ranked (16 Tier 1, balance Tier 2 and below), 10 actionable (tier-ranked plus verified email).

Test: blind run of Iceberg V2 on the same mandate, zero input from the adviser's existing work, 4 hours of unattended runtime.

Result: 93 contacts across two parallel buyer angles. 67 with verified work emails, 95% on-thesis. 11 of the adviser's named contacts surfaced independently, including 5 of his 16 Tier 1 picks. Plus 33 services-rollup PE firms with 47 partner-level contacts the adviser's list did not contain.

Floor, not ceiling. These numbers came from a single-shot V2 cold-start: no prior platform learning, no warm-path scoring, no master contact graph populated across workspaces, no adviser prompts or context, and no iterative refinement. Every mandate on the platform is designed to be re-run multiple times across its lifecycle, with progressively less input each pass. Thin feedback after each delivery (kept, rejected, asked-for-more) compounds into a sharper next run. This report measures the baseline before any of those compounding mechanisms kick in.

Headline

Iceberg V2 delivered ~75% of adviser-grade quality, 6.7x adviser-actionable contact volume, in 12.5% of the time. Roughly 50x throughput per analyst-hour at this benchmark.

For procurement and security reviewers: Annexes A through C and the "Why Not DIY" section cover data flows, sub-processors, hosting topology, and the deterministic ingest path that prevents the model from generating contact details.

Benchmark and Test

The adviser's hand-built universe (4 days)

112 candidate companies, with 24 columns of structured rationale per company
Tier system applied: 80 tier-ranked (16 Tier 1, balance Tier 2 and below), balance untiered
30 of 112 companies (27%) had a verified contact email captured
10 actionable contacts (tier-ranked plus contactable email)
Effort: ~32 senior analyst hours at AUD 150 to 250 per hour, AUD 4,800 to 8,000 of sunk time per mandate

Iceberg V2 blind run (4 hours)

Inputs: company name, sector, geography, revenue, EBITDA margin. Zero exposure to the adviser's tier list, rationales, or contact picks. No prior contacts in the workspace, no CIM, no seed list.

Method: two parallel sourcing paths (strategic operators, services-rollup PE), per-contact strategic reasoning attached to every name, 4 hours unattended.

Output: 93 contacts across 59 unique organisations (46 strategic operator contacts at 26 firms, 47 PE partner contacts at 33 firms). 67 contacts with verified work emails (72% overall, 78% on the PE side). 95% on-thesis for the trade-sale brief, no VC or UHNW noise.

V2 cold-start, compounding mechanisms not active

No prior platform learning from other adviser mandates feeding historical fit signals
No warm-path scoring from accumulated relationship and engagement data
No master contact graph (empty at the time of this run)
No adviser inputs during the run; the pipeline operated on mandate fundamentals alone
No iterative refinement. This was a single-shot run. Mandates are designed to be re-run three to five times across their lifecycle, with thin feedback on each delivery sharpening the next pass and progressively less input required each time.

Re-running the same mandate after 20 adviser mandates have moved through the platform, with 15 minutes of upfront context, or with one round of feedback on this first pass, would meaningfully exceed these numbers.

Comparison Matrix

Metric	Adviser hand-build	Old Iceberg (rejected)	Iceberg V2
Total contacts	112 (80 tiered)	100	93
With verified work email	30 (27%)	99 (99%, wrong people)	67 (72%)
Actionable contacts	10	0	67
On-thesis fit	High	~10-20% (VC/UHNW noise)	~95% (clean)
Per-contact rationale	Yes (paragraph)	No (generic tag)	Yes (scored, sourced)
PE-side coverage	13 PE-backed targets, 0 PE contacts	None	33 PE firms, 47 partners
Person-level rediscovery (blind)	n/a	0	11 of adviser's named contacts
Wall time	4 days	~2 hours	4 hours
Analyst hours consumed	~32 hours	n/a	0 hours

Where Iceberg V2 Hits the Adviser Bar

Person-level rediscovery, not just company-level

11 of the adviser's hand-picked named contacts surfaced in the blind run, the exact individuals he had identified as the right decision-makers after his own 4 days of research. Of these:

9 are at the adviser's tier-ranked targets (5 Tier 1, 4 Tier 2)
2 are at companies he had on his shortlist but did not tier
8 surfaced at the right current employer; 3 at a secondary affiliation (board seat, side venture, past role) and are easy cleanup before outreach

This is a contact-selection signal, not just a sourcing signal. The pipeline's reasoning about who the right person is at each strategic buyer converged with a senior adviser's manual pick 11 times, blind.

PE-side coverage the adviser did not build

The adviser's list flagged 13 PE-backed targets but contained zero PE-side contacts. Iceberg V2 produced 33 services-rollup PE firms with 47 partner-level contacts, 78% with verified emails. The PE backer the adviser had flagged inline as the existing investor in his top Tier 1 target surfaced independently. Two further PE firms he had referenced as prior backers of his tier-ranked names also appeared organically. The remaining firms are genuine universe expansion in the buyer category most likely to actually close a trade sale at premium multiples.

Buyer-type segmentation aligned to the brief

Where Iceberg's previous output collapsed contacts into generic 'Strategic / UHNW' tags and surfaced VC and angel investors the adviser explicitly rejected, V2 segments cleanly into strategic operators and services-rollup PE. Each contact carries a scored strategic reason and a portfolio_companies field for context. This is the structural fix to the core complaint that ended the original pilot.

Where Iceberg V2 Trails, and How the Gap Closes

Organic Tier 1 hit rate at 31%

5 of the adviser's 16 Tier 1 picks surfaced at the contact level in the blind run. Several missed targets appear in PE firm portfolios elsewhere in the same output but did not surface independently in the operator-side path.

How this gap closes: three compounding mechanisms tighten Tier 1 hit rate from the V2 baseline. First, the intended workflow has the adviser upload the CIM, key contacts, and any existing target list at mandate setup. The blind run benchmarks the cold-start ceiling, not the production ceiling. Second, the master contact graph accumulates across every mandate run on the platform, so each new adviser tightens the coverage map for every other adviser working in adjacent verticals. Third, every mandate is designed to be re-run with progressively lighter input. Thin feedback after each delivery (kept, rejected, asked-for-more) compounds into a sharper next pass, with the same mandate typically run three to five times across its lifecycle. With CIM seeding, modest platform tenure, and one round of iterative refinement, Tier 1 hit rate closes toward 90%+ in production.

Email coverage on operator-side founders at 61%

Acceptable on a cold run. Founder and CEO emails at privately-held offshore-services businesses are the hardest pattern to enrich. Coverage on the services-rollup PE side ran at 78%.

Throughput and Cost Math

The core economic case for Iceberg-assisted sourcing on a single mandate:

Dimension	Adviser hand-build	Iceberg V2	Ratio
Wall time	4 days	4 hours	12.5%
Senior analyst hours	~32 hours	0 hours	n/a
Actionable contacts produced	10	67	6.7x
Cost per mandate	AUD 4,800 to 8,000	$500 per 150 contacts	~5% of adviser cost
Effective analyst throughput	0.31 contacts/hour	Unattended	~50x at this benchmark

Iceberg replaces the 4-day analyst grind with a 4-hour unattended run that delivers 6.7x the actionable volume at roughly 5% of the adviser cost base, while preserving ~75% of the qualitative bar a hand-built universe sets. Closing the remaining 25% takes a 20-minute upload of the adviser's CIM and seed contacts, the intended production workflow.

Why Not DIY: Claude Plus a Database

The natural question for any well-resourced firm is whether an analyst with a PitchBook seat and an LLM subscription could produce the same output. The answer is no, and the reason is not the LLM. It is the infrastructure underneath: deal-led discovery, multi-provider verification, persistent memory, and the audit trail required to put an adviser's name behind the contact.

Dimension	DIY (Claude + PitchBook or Grata)	Iceberg
Universe construction	Static database filters, analyst prompts the LLM over CSV exports	Deal-led discovery across registries, trade press, association lists, and historical-deal datasets, run as parallel buyer-angle searches per mandate
Contact verification	Manual, varies by analyst and chosen enrichment tools	Multi-provider verification waterfall, emails and phones are never generated by the model
Memory across mandates	None, every deal starts cold	Master contact graph keyed by email and LinkedIn URL, cross-mandate signals carry forward
Analyst time per mandate	30 to 40 hours for prompting, deduping, enriching, and CRM formatting	Zero hours to run, 4-hour unattended pipeline straight to CRM
Per-contact audit trail	None	Source, rationale, and score recorded against every contact
Governance posture	Ad hoc, varies by analyst and project	Workspace-scoped tenancy, Sydney-region inference, contractual no-train terms with the model provider

DIY assembles each of these as separate workflows, glued together by an analyst. Iceberg ships them as one system, with verification, audit, and governance built in by default. The cost difference is not in the LLM bill. It is in the analyst hours saved, the verification rework avoided, and the procurement comfort that comes from contact data with full provenance.

The Sourcing Engine

Every component below exists to surface buyers and investors an adviser could not realistically find in a 4-day analyst hunt. The system is built around deal-led discovery, multi-angle orchestration, multi-provider verification, and explicit score-and-rationale filtering before contacts reach the adviser.

Multiple angles per mandate, run in parallel

Each mandate is decomposed into the buyer categories most likely to close it at premium multiples. A trade sale might run strategic-operator and services-rollup-PE discovery in parallel; a capital raise might run sector-thesis VC, family office, and growth-PE paths simultaneously. The architecture also supports multiple mandates running in parallel on the platform at any given time. A firm with twenty live mandates can have all twenty universes building simultaneously, each tuned to its own deal, with no contention across runs.

Under the hood

Concurrency by design Long-running sourcing work is dispatched to dedicated async queues. Twenty parallel mandates run as twenty independent jobs against the same workspace-scoped API. No code change required to scale horizontally. See Annexe A.

Data sources and refresh cadence

For each mandate, the system searches for organisations that have done similar deals in the recent past, drawing on an internal historical deals dataset plus searches across industry directories, regional registries, trade-press coverage, association lists, and company websites. For investor mandates, the search walks from those companies back to the investors who wrote the cheques.

Coverage vs PitchBook or CapIQ: the majority of named deals and investors those platforms cover, plus a long tail they miss entirely. The owner of a forklift rental business does not sit in PitchBook; a deal-led search finds them.

Refresh cadence is per-mandate, not scheduled. Each new brief triggers a fresh search rather than pulling from a static database. Contacts older than 90 days are re-enriched against live providers before delivery. Every email, phone, LinkedIn URL, and employment entry comes from a contact-data provider with a deliverability status attached. There is no facility for the AI to hallucinate a contact detail.

Under the hood

Why a static database export plus a chat window does not match this A PitchBook seat and an LLM subscription give an analyst an index and a prompt box. Deal-led discovery, the parallel multi-angle orchestration, the verification waterfall, and the workspace-scoped audit trail are separate pieces of infrastructure that have to exist before the output is procurement-grade. See "Why Not DIY" above for the full comparison.

Methodology, dedup, and confidence scoring

Each mandate runs through five stages: organisation discovery anchored on adjacent-deal companies, then people discovery by title/role, then multi-provider enrichment (work email, phone, LinkedIn, employment history), then model scoring (0-100 against the mandate rubric, with rationale), then filter and deliver. Contacts below a hard score floor of 50 are not surfaced. Above the floor, all contacts up to the requested cap are delivered with score and rationale visible.

Adviser-side dedup is handled by passing existing contacts to the API as a suppression list, so prior outreach and known names are filtered out before delivery. The model scores over structured records that have already come back from a verification step. It never produces a contact detail itself.

Under the hood

The LLM proposes, the schema disposes Every contact write goes through one of three deterministic paths (sourcing-API ingest, CSV import, manual entry), all sharing the same validators and dedup helpers. Emails are lowercased and regex-checked; phones are normalised to digits-only for dedup; malformed values are dropped. The agent tool catalogue has no function whose job is to generate an email or phone, and a schema gates every tool call. Same input, same output, every time. Full contract in Annexe C.

Data Sovereignty

Procurement reviewers ask, correctly, exactly what data leaves the adviser's control and which third parties touch it. The complete answer is below; further detail in Annexes A and B.

What the adviser submits to the sourcing API

Required: buyer type, geography list, natural-language brief on the ideal investor or buyer (20+ chars), and a mandate-context object describing the company being transacted. Optional: structured filters (sector, stage, transaction type), title filters, organisation criteria, suppression lists for known organisations or contacts, seed contacts, advisor-context free text, and a delivery cap (default 30, max 150).

Anonymisation option

Iceberg can anonymise the payload, passing a generic company name and a sanitised company description, and still run a successful deep search. Deal characteristics (sector, geography, transaction type, target profile) are what drive sourcing.

What touches the model provider

Mandate brief and per-contact structured payload are sent to an enterprise LLM provider for classification, query refinement, and contact scoring. The model operates under enterprise data-processing terms: customer inputs are not used to train the model (contractual), region-pinned to Sydney, TLS in transit, AES-256 at rest. The model sees the structured payload only, never the verified email address or phone number.

Hosting

Main platform (adviser-facing CRM, contact store, AI agent layer, all customer data) sits in AWS ap-southeast-2 (Sydney). Inference runs through an enterprise LLM in the same Sydney region. Some legacy components of the sourcing pipeline currently sit on overseas infrastructure; full migration to onshore residency is in progress and scheduled ahead of any procurement-driven rollout. Detail in Annexe B.

Annexes

Technical detail for engineering, security, and procurement reviewers.

APlatform Architecture
BSecurity Posture
CDeterministic Contact Ingest

Annexe A · Platform Architecture

Audience: engineering lead, infosec architect, or CTO doing diligence. Goal: enough detail to trust the stack and run a trial. Deliberately silent on the matching, ranking, and agent-orchestration logic that makes the product differentiated; available under NDA.

A.1 System context and trust boundaries

Browser to API. Single trust boundary; all data flows through one authenticated, workspace-scoped API.
API to third parties. No client-side calls to LLM or sourcing providers. The API is the choke point, so we can log, redact, and rate-limit centrally. LLM inference runs through an enterprise endpoint pinned to australia-southeast1 (Sydney); inference tier and data tier sit in the same metropolitan region.

A.2 Stack at a glance

Layer	Tech
Frontend	Next.js 14 (App Router), NextAuth, React Query, Tailwind
API	NestJS 10, TypeScript, class-validator. 26 feature modules. JWT global guard, role and workspace decorators, OpenAPI exposed at /api.
Persistence	PostgreSQL 16 via TypeORM. Single primary, 28 entities, ~120 versioned migrations. synchronize: false; schema only changes via migrations.
Cache / queues	Redis 7 + BullMQ async workers (email sync, sequence sends, follow-ups, embeddings, sourcing polling).
Object storage	AWS S3. Access brokered by the API.
AI orchestration	Agent runtime hosted inside the API process. Tools call the same workspace-scoped repositories the REST API uses; no shadow data path. Traces export to a managed observability platform.
Sourcing client	External deep-search API client. Results land in dedicated tables and are promoted into workspace contacts under adviser review.
Email	Resend (transactional), Gmail / Outlook OAuth (user-scoped). Outbound and inbound both go through the API.

A.3 AWS topology (ap-southeast-2 Sydney)

All adviser-facing platform data in Sydney. RDS, Redis, S3, ECS all in ap-southeast-2. Some legacy sourcing components currently offshore; full migration to onshore in progress (see B.1).
TLS terminates at the ALB. Internal traffic to the task on the AWS private network only; task SG accepts no other source.
Data tier is private. RDS and Redis security groups accept ingress only from the ECS task SG.
Secrets from SSM Parameter Store as SecureString (KMS). Not baked into images, not in logs.
Infrastructure as code (Pulumi). Staging and production are separate stacks with separate VPCs, databases, Redis, ECR, and DNS records.

A.4 Authentication and tenancy

JwtAuthGuard registered as a global guard. Every route authenticated unless explicitly @Public().
JWT payload carries user id, current workspace id, and workspace roles. Live role re-loaded on every sensitive check rather than trusting stale token claims.
Multi-tenancy enforced in code: every business entity has a non-nullable workspaceId. Service layer filters on workspace.id on every read and write.
Role decorators (@RequireRole, @RequireCrmAccess, @RequireAdvisorAccess, @RequireSuperAdmin) layer on top of the auth guard.
ThrottlerModule globally registered. CORS allow-list driven from environment config. Body limit 50 MB.

A.5 Where AI sits in the stack

The AI orchestration layer is hosted inside the NestJS process. One process to log, throttle, redact, and observe. AI tools call the same workspace-scoped repositories the REST API uses, so there is no shadow data path. Conversation memory persists in a dedicated schema in the same DB with the same backup, encryption, and region. Model traces (model, tool calls, latency, token counts) export to managed observability.

Model calls go through an enterprise LLM endpoint pinned to australia-southeast1 (Sydney). The choice of enterprise endpoint (rather than the consumer API) is specifically for the data-handling contract: prompt and response data not used to train foundation models, region-pinned, TLS in transit, enterprise DPA covers data processing. The request leaves the AWS VPC for a managed inference endpoint but never leaves the Sydney metro region.

Agent tools split into retrieval tools (read from our DB, the sourcing API, or live web search; returns include citations) and mutation tools (write through the same NestJS services as the REST API, so all the same validation, normalisation, and dedup apply).

A.6 Deploy and release

Single CI pipeline: build container, push to ECR with commit SHA, ECS rolling deploy. Immutable SHA tags.
Migrations run on container boot before the HTTP server starts. A bad migration fails the rollout; the previous task keeps serving.
Staging and production fully isolated.

A.7 Out of scope for this document

The contact-to-buyer matching model, the agent orchestration logic, and the scoring weights are not described here. Available under NDA.

Annexe B · Security Posture

Audience: procurement or InfoSec reviewer working through a checklist. Each section names the control and the implementation today; code references and IaC walkthroughs available on request.

B.1 At a glance

Topic	Status
Data sovereignty	Customer data tier in AWS ap-southeast-2 (Sydney). LLM inference pinned to Sydney. Some legacy sourcing components currently offshore; full migration to onshore in progress.
Hosting	AWS (ECS Fargate + RDS + ElastiCache + S3). IaC via Pulumi.
Authentication	NestJS JWT global guard, bcrypt password hashing, NextAuth on the frontend.
Multi-tenancy	Workspace-scoped data model enforced in code on every query.
Authorisation	Role decorators on every controller.
TLS	TLS 1.2+ at the ALB, ACM certificates, HTTP-to-HTTPS 301.
Encryption at rest	RDS, ElastiCache, S3 all encrypted with AWS-managed KMS keys.
Network isolation	App + data tier in private subnets. Ingress only via ALB.
Secrets	SSM Parameter Store SecureString (KMS). No secrets in source control.
Logging	ECS to CloudWatch; exception capture to PostHog.
Backups	Automated RDS snapshots. Redis snapshotting on production.
Source control	Private GitHub repo. CI deploys via GitHub Actions.

B.2 Authentication and authorisation

Passwords: bcrypt (cost 6 on signup, 12 on stored refresh tokens). @Exclude() on the password field so it never serialises out of the API.
Access tokens: signed JWT, 7-day expiry, validated on every request. Refresh tokens signed and bcrypt-hashed at rest. Signup codes: 24h expiry, constant-time comparison.
Role model: ADMIN, MEMBER, USER, CHAT_ADMIN, CHAT_USER, ADVISOR_ADMIN, ADVISOR_MEMBER, gated by decorators.
Workspace context decorator validates that the user actually has a role in the requested workspace before injecting it.
Super-admin: boolean on User, gated by decorator, used for Iceberg-side concierge support. Logged.

Result: no controller can be reached by an anonymous user unless explicitly @Public(), and no workspace data by a user not a member of that workspace.

B.3 Transport, encryption, network

TLS terminates at AWS ALB with ACM certificates, TLS 1.2 minimum. HTTP-to-HTTPS 301. CORS allow-list from environment config. Global request validation with strict mode (whitelist, forbidNonWhitelisted, transform).
Encryption at rest: RDS, ElastiCache (with in-transit encryption and auth token), and S3 all encrypted with AWS-managed KMS. Secrets in SSM SecureString. CloudWatch logs KMS-encrypted.
Network: dedicated production VPC. Public subnets carry the ALB only; ECS tasks sit in private subnets. RDS reachable only from the ECS task SG on 5432, Redis on 6379. Single NAT Gateway egress. Private ECR; image pull via task execution role.

B.4 Secrets and supply chain

Runtime secrets in SSM Parameter Store SecureString, KMS-encrypted. Task definitions reference SSM parameters at boot; values not embedded in image layers. .env files gitignored.
Pulumi state in S3 with passphrase-encrypted config secrets. GitHub Actions secrets per environment. OIDC federation on the roadmap to replace static IAM keys.
Dependency hygiene: package.json overrides pin transitive deps with known fixes. ECR scan on push. npm audit on developer machines; CI integration of npm audit + Dependabot on the roadmap. Pinned Node LTS base image. RDS and Redis auto-minor-version upgrade enabled.

B.5 Logging, monitoring, audit

Application logs: NestJS Logger to stdout to CloudWatch (per-environment log group).
Exception capture: PostHog globally registered; uncaught and unhandled rejections captured at process level.
Auth log lines record sub, email, iat, exp. No token, no PII payload. CORS rejections logged.
AI observability: model, tool calls, latency, token counts exported to a managed observability platform.
Metrics: CloudWatch default metrics, ALB target health. CloudWatch alarms and on-call paging on the roadmap.
A structured audit table (actor, action, entity, before, after, at) is planned for procurement-driven customers.

B.6 Backups and disaster recovery

RDS automated daily snapshots; point-in-time recovery within retention. Single-AZ today on the standard SKU; multi-AZ enablement on the roadmap.
Redis snapshot retention: 7 days production, 1 day staging.
S3 versioning: off today, planned for production document buckets.
Database restore drill documented; full DR exercise planned ahead of procurement-driven rollout.

B.7 Sub-processors

These third parties receive request payloads or store user data on Iceberg's behalf. All are reachable from the API only, behind authenticated tools. No client-side calls.

Vendor	Data flow	Purpose
AWS (ap-southeast-2)	All customer data	Primary hosting (compute, DB, cache, storage, secrets)
Cloudflare	DNS + edge TLS	DNS and CDN
Enterprise LLM (Sydney region)	Prompt text including excerpts of adviser-supplied content; responses returned to API	Inference. Enterprise terms: prompt and response data not used to train foundation models. Region-pinned. TLS in transit, AES-256 at rest. Enterprise DPA covers data processing.
Sourcing provider (deep-search API)	Mandate parameters; receives back investor records	Investor sourcing pipeline
Web search provider	Search queries derived from adviser intent	Real-time web search citations
Resend	Outbound email content + recipient address	Transactional + system email delivery
Gmail / Outlook (per-user OAuth)	User's own inbox under user-granted scopes	Optional inbox sync for the adviser
Observability platform	LLM request/response traces, token counts	Observability of AI calls
PostHog	Product analytics + exception payloads	Analytics + error monitoring

Formal sub-processor disclosure with vendor security pages and DPAs available on request.

B.8 OWASP coverage

Injection: parameterised queries via TypeORM, class-validator on every DTO, global ValidationPipe in strict mode.
Broken auth: global JWT guard, refresh token bcrypt-hashed, constant-time comparison for signup codes.
Sensitive data exposure: @Exclude() on password fields, secrets in SSM, no PII in logs by convention.
Broken access control: workspace and role decorators on every controller, verified by tests.
XSS: React auto-escaping. CSP review on roadmap. Deserialisation: all inputs through class-validator DTOs.

B.9 Access to production and incident response

AWS console: Iceberg engineering only, IAM users with MFA required.
Production DB access: ECS task SG + allow-listed admin IPs only.
Deployment: GitHub Actions only with scoped AWS credentials.
Detection: error log alarms (in flight), exception dashboards, observability error-rate signals.
Response: engineering acknowledges incidents during AU business hours, on-call rota outside hours.
Customer notification: workspace owners notified directly for any incident touching their data.

B.10 Known gaps and roadmap

Surfaced transparently because procurement will ask:

Reduce access-token lifetime from 7 days, add refresh token rotation
Migrate CI from static IAM keys to GitHub OIDC federation
Multi-AZ RDS, S3 versioning, lengthened RDS backup retention
Structured audit trail table on sensitive mutations
CloudWatch alarms + on-call paging wired into the standard incident channel
Full migration of remaining offshore components to onshore residency
SOC 2 readiness assessment (Type I target September 2026)

Annexe C · Deterministic Contact Ingest

The question that comes up at every meeting: "How do you stop the model making up emails and phone numbers?"

Short answer: the model is not asked to. Contact data flows through a deterministic ingest pipeline (external sourcing API, CSV import, or user-entered fields) and lands in the database via the same validation and dedup helpers in every case. The LLM can decide which contact to act on, but it cannot author an email address or phone number into the system.

C.1 The three (and only three) paths

Path	Source of truth	Determinism mechanism
Sourcing run	External deep-search API (mandate-driven)	API returns structured rows. Upsert into GlobalContact, then promoted into the workspace as Contact.
CSV / spreadsheet import	The adviser's file	Parsed row by row. Values run through normalisers before insert.
Manual entry	Adviser types into the CRM form	Same DTOs, same validators, same normalisers as import.

There is no fourth path. The AI does not invent contact records. The agents call the same workspace-scoped services the REST API uses, with the same schemas and the same guards.

C.2 The validation boundary

Every write of an email or phone goes through the same two helpers:

Email. Lowercased, regex-checked. Malformed values are dropped, not stored.
Phone. Display format preserved, but deduplicated on the digits-only form so +1-415-555-0100 and 14155550100 collapse to one record.
Array dedup. Set-based on the normalised form. First occurrence wins.
Legacy and array sync. Single-string legacy columns and array columns are always rebuilt from each other so they cannot drift.

In parallel, ~25 enum-like normalisers (geography, investor type, check size, priority, etc.) are contract-bound to return null on no-match rather than throw. A bad row never poisons a batch import; the caller decides whether to default or reject.

C.3 Provenance

Each Contact records how it got there: source (USER, SOURCED, or GLOBAL), sourceGlobalContactId back-reference, lastGlobalSyncAt timestamp, and per-deal aiRanking + aiReasoning. Signals (news mentions, events) carry their source URL. The canonical GlobalContact stores the raw upstream payload so we can re-parse later without going back to the provider. A reviewer can answer 'where did this email come from?' by inspecting the chain. No audit dead-ends.

C.4 What the LLM can and cannot do

Agents are wired to a fixed catalogue of tools. Each tool has a schema-validated input. The model can choose a tool and propose arguments; the schema validates before the tool body runs, and the tool body calls back into the same NestJS service the REST API uses.

Tool family	Can the model invent contact PII?
Retrieval (research, sourcing, web search)	No. Pulls from external providers or our DB. Returns include citations and source URLs. Extraction temperature 0.2 (near-deterministic).
Mutation (create, update)	The model can propose values, but they flow through the same DTO + validator + dual-write path as a manual form submission. Malformed values dropped, duplicates collapsed.
PII generation	No tool with this purpose exists in the catalogue.

Agent prompts also enforce the contract: research first, present, confirm, then write. A row without confirmed provenance does not get written.

C.5 Schema-bound outputs and idempotency

Structured model outputs (pitch-deck summaries, contact insights) use a schema as the response format. The output shape is fixed; the model cannot add unknown fields, and if it fails the schema the call fails loudly rather than silently writing a garbage row.

Running the same operation twice does not duplicate rows. Race-safe upserts on global organisations and people (unique normalised-name index, compound lookups on LinkedIn URL, email, name). Bring-in is idempotent on (workspaceId, globalContactId). CSV import surfaces specific row conflicts rather than silently double-writing.

C.6 What determinism does not mean

Generated text is still probabilistic. A drafted outreach email is not byte-identical between runs. The contact identifier is deterministic; the content the adviser sends is a draft for human review.
Upstream coverage is not deterministic. If the sourcing provider has not yet indexed an obscure target, no part of the stack can manufacture that record. The model surfaces low-confidence results with citations rather than inventing.
Live web search is non-stationary. We capture the citation at the time of writing; the web underneath can change.

What we guarantee, plainly:

Every email or phone in the system can be traced to a non-AI origin.
The same ingest payload produces the same contact record, every time, and does not duplicate on replay.
The model cannot author a new email or phone into the database. It can only propose one through a validator that will drop malformed inputs.
Every contact knows its source. The adviser can challenge any record.