Ryan Tang
BenchSciLead Product Designer2024

Designing for Scientists Who Don't Trust LLM Answers

Led design on BenchSci's GenAI pivot within the Experiment Validation stream. Built a standalone MVP, and hybrid interface with agentic chat alongside an evidence table for scientists that needed more than a LLM chat.

AI ProductResearchEnterpriseStrategy
0K+
Downmarket users reached
0-figure
Thermo Fisher partnership
0
AI products released
0+
Product teams adopted the journey framework
Table aware agentic chat where users can deeply engage with the table

The context

BenchSci makes software for pharmaceutical scientists, who need to validate that experiments are successful and reproducible before committing millions to a drug program. They are users of our enterprise clients, and deeply skeptical by training because scientific method teaches you to distrust conclusions that don't have sufficient their evidence. They do their own analysis despite recommendations, and aren't as trusting as casual users. The rise of LLMs presented new opportunities to empower their research, but also challenges around trust and evidence.

Before the pivot, BenchSci had two segments: a mid-market search product and an enterprise platform bundling multiple point solutions across the preclinical workflow. Each had its own search logic, interface patterns, and data model.

Influencing AI strategy before our pivot

Before the company's GenAI pivot, I'd been exploring conversational interactions as a potential solution to a platform-wide search problem. When ChatGPT arrived and the company started asking what LLMs could mean for the business, I built a prototype quickly and partnered with my Director of Product Design. It was shown to the company at a retro, and it directly led to the formation of a tiger squad to explore what LLMs would mean for us at BenchSci.

The early agentic conversational prototype shown at a company all-hands

Within weeks, we shipped the first result: an AI-generated summary I designed for the legacy platform. In it I ensured that citations were visible, which was table-stakes for scientists that needed a clear path to the evidence itself.

I designed our first LLM feature, an AI summary that shipped in the legacy platform. Inline citations helped to build trust and prove out the technology in use case.

Early AI guidelines I created as part of the first LLM feature. These would be guiding in future features.Early AI guidelines I created as part of the first LLM feature. These would be guiding in future features.

It became the most engaged feature in the platform. That signal shaped our agility going into the full pivot that restructured our company into value streams.

My role after the pivot

Each value stream was responsible for a surface area across the preclinical drug discovery workflow. I owned the Experiment Validation value stream which included two existing products (Reagent Selection and Architect), plus the open-ended problem of defining what experiment validation should look like on a platform that didn't yet exist.

On Experiment Validation I became a sole designer with three PMs, rotating engineering pods, no dedicated UXR. There was also growing commercial pressure as enterprise accounts were watching our GenAI response closely, and renewals were on the horizon.

The challenge: evolve legacy products and extend their value into a new GenAI experience simultaneously.

Part 1 — Building the shared frame

As our team formed, I wondered what did the challenge even mean our newly formed around an ambiguous surface area? To address this, I prioritized creating a shared mental model for the product area and our team. Without one, we'd have no principled answer to which workflows to support, which legacy features had parity worth preserving, or even what "experiment validation" even meant as a design surface.

I began by revisiting our assumptions on our legacy platform and products. This was the view that scientific workflows moved through linear, gated stages. This was true of the drug discover project pipeline, but that model missed key parts of the scientists workflow. Despite that, it was a useful model that had shaped the IA, feature scoping, and how teams talked about priorities.

To gain a more accurate perspective of scientists journey in experiement validation, I revisited previous research and ran user interviews with internal and external scientists to test our old mental models and understand how scientist actually engaged in experiment validation beyond our point solutions.

The assumption

Scientists work linearly

01Create hypothesis
02Plan experiment
03Select reagents
04Run experiment
05Interpret results

Our legacy products were built on this model — discrete, gated stages.

The reality

Scientists work recursively

Build contextPlan protocolsRun experimentAnalyze resultsAlign & decide

10 interviews, along with previous research, revealed a recursive, fluid process, with scientists revisiting questions as evidence emerged.

I discovered that scientists moved through a web of interdependent decisions with outputs feeding back into earlier step and workflows connecting to things that happened entirely outside our platform. Scientists worked more fluidely and iteratively than we had initially assumed. This finding meant that linear journeys or the view of the LLM as single "workflow" tool would be limited in it's ability to address scientists needs. It however, also presented an opportunity to surface clearer outputs to scientists based on the context they provide.

To share this insight, I built a new user journey mapping the discrete steps scientists took, the context each step depended on, the decision each step produced, and where each step connected outside the platform.

The interactive journey map, originally built for just my team, but was adopted by other product and commercial partners as a shared decision frame

Originally, I strugged with how I should present it. Linear journey maps and static diagrams had a certain degree of readabiltiy. The non-linear, free flow required more understanding. I decided to make and interactive rather than static journey map where teams could drill into relevant steps, filter by workflow, and build their own mental model as they explored rather than being handed one. A static diagram typically gets presented once and filed. My explorable one got used repeatedly by people who weren't in the room when it was made.

The map spread beyond product and design without being pushed. Commercial partners used it to frame the product area for enterprise clients. Three product teams adopted the journey framework I created.

For our team, it became a key artifact that answered which legacy areas we were covering, which we weren't, and what "value parity" meant when the interaction paradigm was changing entirely and feature parity wasn't possible.

Most improtantly, this journey map helped us to focus on a specific area first, rather than try to tackle the entire surface area. We decided this area would be scientific methods as there was commercial interest, we had some data modelling present, and it represented all the "pieces" of an scientific experiment (materials, procedure, results).

Part 2 — When the obvious answer was wrong

With a clearer definition of what we were building, a second question emerged: what kind of interface do scientists actually need?

The company's GenAI direction was pointing us toward a chat-based "answer engine." Market momentum with various chat bots reinforced it. Scientist were skeptical of the interaction model, and I share thier skepticism. A chat interface tends to be directional by design: the AI decides what's relevant and presents it to you. For scientists making costly decisions in thier multi-year drug research, ceding that navigation to an AI introduces a risk they cannot accept, nor accept accountability for.

To answer this I partnered with a user researcher from another value stream to validating the chat-first vs traditional GUI direciton. We ran 15 internal and external scientists interviews. Rather than testing one end-to-end prototype, I designed the study to isolated variations at each touchpoint. Isolation let us pinpoint where different interaction model succeeded or failed, and the variations allowed us to uncover qualatively feedback more rapidly.

Pure conversationalPure conversational
Hybrid — chat + tableHybrid — chat + table
Hybrid — app firstHybrid — app first
Pure app / tablePure app / table

The interviews yielded many insights, which we presented to the head of product, and other stakeholders. In particular, highlighted three key findings:

  1. Scientists needed agency over the answers. In a pure chat interface, the AI decides what you see. We could prompt it directionally, but it'll still have it's own probabilistic response. This limited the possible depth and breadth a scientist could get. Scientists making consequential need to navigate the evidence themselves and form their own judgments. The scientists in the study were comfortable with AI assistance, but they were uncomfortable with an AI positioned between them and the data that just gave them "answers".

  2. Trust required visible evidence, not just citations, An AI summary with citations attached wasn't enough. Scientists wanted to see the underlying data, not just a label pointing to it. External LLMs may have influenced scientists perceptions on the reliability of the citations. For common consumers, this may be less noticed as most users skip visiting each citiation. This is different for scientists, who read every source. The visiblity of the evidence upfront surprisingly built the trust in the citations.

  3. Efficiency meant getting to a confident decision faster. Inside the platform, scientists wanted to scan, not read. The unit of value was the fastest path to the specific piece of evidence that would let them make a call. Tables beat text, not only because they surface multiple attributes at once, but they enable comparison and navigation. Additionally, scientific figures communicate method quality in ways a text description cannot.

This meant that we needed two modes for two different cognitive needs. A chat for directed retrieval and freeform entry, this got scientists to relevant evidence more quickly; and a table for analytical navigation, letting scientists evaluate, compare, and decide on what was important for them. This research gave not only our team, but the other product teams to go down a hybrid agentic direction.

Part 3 — The design

As I was designing, early versions kept surfacing the same friction: scientists either lost sight of the evidence while talking to the AI, or lost access to the AI once they were deep in the table. Getting both surfaces to coexist without hurting the other took longer than either component did individually. This was due to both the technical complexity, and the nuance of where the locus of focus was.

Scientists needed to be able to talk to the AI and look at evidence simultaneously. This meant that inline tables or lengthy conversational responses as the primary focus was the wrong direction. We needed to give scientists the evidence as the second surface of focus. The left-right split followed from the constraint: orientation on the left, analytical work on the right. Both surfaces stay persistent. Follow-up questions don't require leaving the evidence. This direction was further supported by what I observed in market.

The hybrid interface layout
1Agentic chat that can invoke tools, call functions, maintain context across turns
2Minimal responses. The chat directs, the table delivers, aiming to reduce latency with parallel processing
3Evidence table is a familiar format for scientists' existing research stack

The decision of having two possible areas of focus also affected how the AI communicated within that layout. Longer chat responses reduced scientists' engagement with the table. In testing scientists would often need to wait for the chat to finish it lengthy response and then attempt to explore the citations before abandoning. This often meant missing evidence that would have changed their experience and research.

Shorter responses accompanied by an artifact (like a table) consistently drove scientists into the experience more meaningfully. There's also a perception effect: three paragraphs at the same latency as three sentences feels slower. The summary stayed collapsed by default for related reasons — a streaming summary growing in real time pushes the table down while pulling attention through motion, at exactly the moment scientists should be reading results. Collapsed, the table is scientists' first impression of a result set. They form their own read before encountering the AI's.

The table decision was a bit controversial compared to our DS which had been been moving toward cards. The cards worked well for workflows where each step is evaluated in isolation, such as a project workflow diagram. But, for scientists in experiment validation, it was often avbout comparing larger sets of data and narrowing down possibilities. When we tested cards, scientists would always asked "can I see these side by side? or in a table?". This question never came up in our more data dense views. Secondly, the layering of figures as a pivot was done for a similar reason. The figures were in cards to allow for comparison, but with far greater density.

This combination of views along with data parity gave scientists the ability to quickly narrow down the information they cared about. It also consequentially determined what we'd need from our extraction pipeline and data model.

The evidence table with collapsed AI summary
1Collapsed AI summary prevents screen drift
2Visual pivot allows scientists to switch between table rows and figure comparison
3Table data controls allow scientists explore the data through known mental models
4Figure availablility at a glance to quickly assess evidence quality
Figure grid view
1Grid to table parity allows scientist to select the same results in a different lens
2Scientist can quick search and filter with the table schema
3Figure grid allows for rapid visual comparison of method and reagent results
AI summary with inline citations
1Inline citations are numbered and linked to the evidence
2Basic feedback and utility tools like copy and save match expectations from other tools
Citation reference panel
1Common evidence list display pattern across the platform builds trust through familiarity
2Evidence card includes qualifying contextual metadata to help scientists understand the evidence
Row-level context menu
1Context menu reduces mental load showing actions when they're needed
2User curation and exploration actions by pinning rows, compare selections, or ask a follow-up question without leaving the evidence view

Save Publication Figure modal: project-based curation flow — scientists organising evidence into shared workspaces without leaving the research view.Save Publication Figure modal: project-based curation flow — scientists organising evidence into shared workspaces without leaving the research view.

Part 4 — Reducing time to market by shipping smaller

We had a full agentic chat design vision, but we discovered many cross-team dependencies and complexities that prevented us from shipping quickly to learn from market. These challenges were: system prompt architecture, agentic framework design, platform-wide data schema work, and a new design system. All of these were still in progress, and waiting meant possibly years of lost time, with active enterprise conversations about Experiment Validation already underway.

Originally we saw two options: wait and build the full platform (right long-term outcome, wrong timeline), or ship partial features piecemeal as inline chat responses. Neither were ideal, or reliable as orchestration and guardrails were a major cross-team challenge when every value stream were trying to redefine thier space at the same time.

I started to question whether we had a dependency that didn't actually need to exist. The agentic infrastructure was genuinely unready. But when we examined what enterprise accounts actually needed to see, the answer wasn't the agentic experience, but rather the data underneath it and the vision of what would be coming. The quality and precision of our methods data was the primary differentiator and moat. Some of it was already in legacy products, on an old data model. A composite data model pulling from both legacy and new infrastructure would let us ship that value immediately, independent of everything else still being built.

The standalone MVP that resulted had a dedicated entry point, minimal search constraints, and the composite data model as its core. The focused scope was a UX decision: a product with a single clear purpose is easier to learn, easier to evaluate, and produces cleaner feedback. It also meant we could introduce the product to a skeptical user base with an experience that was complete and correct for its scope — rather than incomplete for a larger one. Scientists form strong opinions about tools quickly and revise them slowly.

First stand-alone experience, that would allow us to validate the data quality, interaction, and direction.

The MVP resonated with users and opened the commercial conversations we needed. The right MVP scope isn't the minimum feature set — it's the minimum scope that delivers the core value cleanly. For this product, that was "high-quality methods data, easy to reach."

Results

Enterprise partnerships were publicly announced during my time leading design on Experiment Validation, including an eight-figure deal with Thermo Fisher. My contribution were the design strategy that outlined what experiment validation was, and the data we'd have to support it. The journey map gave product and commercial teams a shared internal frame for alignment and became an artifact that resolved what "experiment validation" covered as a product area. The standalone MVP demonstrated our data, and gave those conversations something working to put in front of partners before the full agentic experience existed.

100K+ downmarket users reached. The enterprise partnerships opened the doors to more users specifically interested in using it for experiment validation. My contribution was designs that served what scientists needed, and a standalone MVP that enabled usage, engagement, and further validation as we create a fuller agentic chat.

3+ teams adopting the journey framework. The journey framework's organic adoption, was a surprise to me, but it signaled that both the mental model presented and the interactive format were useful. This explorable joruney map gets used repeatedly by people who weren't in the room when it was made, and continues to show up in product and commercial conversations.

Reflections

As scientists engaged with this, I kept thinking about how for users like our scientists AI trust was so tied to agency and accuracy. If the AI hallucinated, that'd be extremely consequential, but the scinetist having ability to navigate on their own to remedy that was very valuable. Design that aims to be too agentic and take away the scientists ability to do their own work will fail with this experet user base.

What I'd do differently:

  1. Align earlier: Earlier alignment on data infrastructure constraints would have turned the MVP strategy from a reactive decision into a planned one. We initially assumed the data and scale we needed would work in the agentic experience we were designing; but the company-wide unified data model changes proved to be a slower and more complex. We arrived at the right answer for the MVP as we approached the building, rather than having planned for it. Earlier infrastructure conversations would have changed the scope of what we prototyped and tested upstream and probably allowed us to both ship faster, and help priortize and influecne the data model.

  2. Start smaller: We began with the full conversational scope, when we could have started with just data definition. In this learned that LLM experiences and AI experiences are very dependent on the data and structure. By starting smaller, we could've focused on the data pipeline, which would've allowed us to focus more quickly.

  3. More intentional prototypeing: During my explorations, I ended up prototyping my own data extraction pipeline. This exploration was primarily to help me understand how the data moves, but ended up becoming a short distraction. The prototype that catalyzed AI at Benchsci was the opposite, it was focused around providing a visual experience for all the LLM conversations. In the future, I'd make sure all prototypes are more focused.

  4. Clarify AI: As LLMs became more widely accepted and used, different colleagues would have different expectations and different mental models around what we could or should do. In the future, I'd spend more time articulating these clearly, before diving in assuming we were all on the same page.

Part 5 - Going beyond an agentic chat

As I observed scientist, I started to observe a surprising behaviour: they often asked questions about the table. This lead to a hypothesis around dynamic tables that would differentiate us from other generic chat products and would align more with scientists natural workflows.

Prototype for agentic table, with context-aware chat.

If you're curious check out the agentic table prototype case study.