About Captur
Captur helps software understand real world scenes in real-time with an SDK for flexible, on-demand visual recognition. We’re a small, rapidly scaling team backed by top-tier investors; we recently closed a $6M seed round to accelerate product and go-to-market growth. We are global leaders in edge ML and have validated +150M images on-device for enterprise customers such as Lime. Next, expanding as a horizontal platform across use cases that require real-time speed, high volume and coverage across a wide range of mobile devices.
About the role
The role focuses on the camera flow inside our clients' mobile apps. We don't do single-photo verification - our models run over the live camera stream, every frame, in real time, on the user's phone. Whatever the user sees, and feels while they're framing the shot is the focus of the problem you'd be working on.
The interesting problem: model output is probabilistic and noisy, the scene is different every time, and the user is operating in the real world - on foot, in the rain, gloves on, one hand free. Visual UI alone usually isn't enough. Haptics, symbolic cues (think game HUD), and visual feedback all need to play together, and at any given moment you have to decide what to surface, on which channel, in what order.
What you'll do
- Real-time feedback design. A scooter rider photographs a finished trip in the rain. The camera streams at ~30fps; the on-device model gives a confidence-shaped output per frame. The job is to turn that stream into something the rider can act on inside their visual loop - usually under 300ms.
- Parallel feedback channels. Visual, haptic, symbolic. Each has different bandwidth, different attentional cost, different latency. Mapping the right signal to the right channel - and prioritising across them when several things are wrong at once - is the headline challenge.
- Generalising across scenes. A courier at a doorstep, a rider at trip-end - same SDK, different scenes, different priorities. We want a feedback system that generalises rather than one hand-tuned per client.
- End-to-end ownership. You'd scope, prototype, ship, and measure one piece of work over the internship. A 1:1 mentor helps you scope it in week one and reviews throughout.
Technical Challenges
Real-time stream verification
- The camera runs at ~30fps. Each frame is an input. The on-device model emits probabilistic output per frame. The feedback layer aggregates that stream and decides what to surface, when (real-time systems, signal processing).
- Sub-300ms end-to-end budget. Anything you display, or vibrate has to land inside the user's visual loop, not after they've moved on (HCI, perception of latency).
- Multi-frame smoothing - confidence over the last N frames, thresholds for triggering feedback, asymmetric thresholds (different for "shot is good" vs "shot is bad") (signal processing, applied statistics).
- Visual UI is one channel. Haptics, and symbolic / game-HUD cues are others. Each has different bandwidth, different attentional cost, different latency to perception (multi-modal interfaces, HCI).
- Mapping signal to channel: haptics are good for "now, wrong" pulses; visual for fine-grained framing guidance. Why? When does that mapping break (interaction design, sensory psychology adjacent).
- Priority ordering: if three things are wrong with the shot at once, which do you tell them first? Why? Does the answer change once they've corrected one (information design, game UI / HUD design).
- Same SDK, different scenes: courier at a doorstep, rider at trip-end, driver photographing a damaged parcel. Different lighting, different priorities, different "what does a good shot look like" (HCI, generalisation).
- The model output is one signal. Time-of-day, IMU motion, ambient light, the user's prior attempts in this session are others. How do you combine them into a single coherent feedback layer (sensor fusion, applied ML).
- How do you build a system that generalises rather than hand-tuning per client (software architecture, declarative configuration).
Qualifications
Required- Some experience programming, e.g. Javascript (React Native, Typescript), Swift, Kotlin, Python.
- Some interaction design, motion design, or HCI work - coursework, a side project, or self-directed study. Send us one example in your application.
- Coursework or projects in computer vision, ML, or signal processing.
- Front-end (TypeScript / React) - useful for our internal debug tools and visualisations.
- Prototyping tools (Figma, Origami, ProtoPie, or hand-rolled HTML / SwiftUI).
- Game UI design, aviation HUD design, accessibility / multi-modal interface work, or anything else where you've thought about feedback beyond visual UI.
- User research methods - think-alouds, contextual inquiry, watching your flatmate use your project.
Role Compensation and Working Details
- 8 - 12 week internship, starting on 22nd or 29th June depending on applicant availability.
- Base salary of £35,000 per annum, pro-rata'd for the duration of the internship.
- Taxable housing stipend of £125 per week for interns whose permanent address is outside London and who need to pay for accommodation during the internship.
- Based 3 days a week from our Liverpool Street Office, with work from home on the remaining days.
- 25 days' holiday plus public holidays, pro-rata'd for the duration of the internship
- Dedicated company Macbook Pro for use during the internship
- Dedicated company Apple or Android device for UX testing during the internship
Captur London, England Office
London, United Kingdom, SW34TG


