Realistic OCR accuracy expectations and the extraction prompt patterns that hold up at 50+ documents. When to step up to a dedicated OCR pipeline.
TL;DR. Cowork does OCR and structured extraction well enough for mid-market document workloads, with one honest caveat: form fields hit 95% accuracy, free-form line items hit 60–70%. Plan a human review step for the first 90 days. The prompt patterns below are what survive at scale (50+ documents per run, not just one).
If you can read it on a normal monitor without squinting, Cowork can probably extract it. If you have to zoom in twice to read it yourself, accuracy drops fast.
| Input | Typical accuracy |
|---|---|
| Form fields (named boxes) | 95–97% |
| Document headers and totals | ~95% |
| Free-form invoice line items | 60–70% — always review |
| Handwriting | Poor — last resort |
The gap between extract and extract with confidence is what determines whether a finance pipeline is genuinely automatable. A 95% number on totals is great; a 65% number on line items means the operator still has to scan every output.
Four rules that make extraction outputs usable:
confidence column saves a manual review.Template:
For each PDF in ~/inbox/receipts:
- Extract: vendor (string), date (YYYY-MM-DD), total (number), currency (string),
category (string from CLAUDE.md list), confidence (low | medium | high).
- One row per receipt.
- Output to ~/output/receipts.csv with column headers in row 1.
- Include source_filename as the first column.
- Flag any row where confidence is low — separate sheet "Review queue".
- Plan first; show me the field mapping before processing.
The plan-first habit matters even more for extraction than for document generation, because errors are silent — a wrong number does not look wrong until someone reconciles it.
For 20 or more files:
extraction-manifest.csv — so you can audit which file produced which row.A 200-file extraction run produces a 200-row CSV plus a manifest plus a review queue. If your review queue is bigger than 30 rows, the prompt needs tightening before scaling further.
Cowork is not Tesseract, AWS Textract, or Google Document AI. If you are processing more than roughly 1,000 documents a day with strict accuracy SLAs, you want a dedicated OCR pipeline, with Cowork on top for the post-OCR work — summarisation, classification, exception handling, narrative drafting.
The Tinkso pattern in those cases:
Below 1,000 documents a day, Cowork alone is usually enough.
The 95% / 65% accuracy gap matters more than buyers expect on the first call. We tell clients to budget a human review step for the first 90 days of any extraction workflow. After 90 days, the prompt and CLAUDE.md are tuned enough that review shrinks to spot-checks — usually one in twenty.
The teams that get this wrong typically over-trust the totals (which are accurate) and under-review the line items (which are not). The fix is not better OCR; it is a better review queue.
Take 10 receipts. Run the extraction prompt template above. Hand-check every line item. The accuracy you measure on those 10 is your team's actual baseline — not the marketing number, not the vendor demo, not the average across the internet. Use that number to decide where the human-review threshold sits.
Book a 30-minute call. We'll ask where you are, what your team needs, and which systems Cowork should touch.