Building with AI in 2024: What's in and what's out

Mar 29, 2024

Purple Flower

A new engineering approach to building, testing, and deploying AI features in your product.

Introduction

It's easy to build a ChatGPT wrapper and call it a day. The hard part is getting users excited about it. We want engineers to do even more with AI and enable features that wow users by capturing the real power of this new technology.

Last year we started building an AI-powered Slack app and quickly realized how much work it takes to design, build, test, and deploy a high-quality AI tool. As easy as it is to use current AI tools such ChatGPT, Perplexity, Tome, etc., it also requires a much higher level of complexity to build those tools. So we created Plumb, an end-to-end solution that provides SaaS teams with a simple way to build complex AI features.

We define complex AI features as any pipeline that uses more than one prompt to process and decide what happens in the next step(s) of said pipeline. But we realize that not everyone is aware of how much easier and faster it can be to build and launch AI features, so we outlined a few of the ways Plumb changed how we approach it now:

👍 In: Step-level observability of pipelines

👎 Out: Reading a wall of code that only engineers can understand

Building with AI in raw code creates a lack of visibility into the results of each step of a pipeline. Plumb pipelines are visual and declarative. This means engineers and non-engineers can understand a pipeline in 30 seconds, instead of the 30 minutes it would take when looking at code.

This also means that prompt engineers can experiment with prompts without needing to go through an engineer. Prompt engineers can test prompts at a specific step in the pipeline, see the results live, and implement those changes in the same place.

Previously, testing a new prompt meant that an engineer would have to set aside an entire afternoon just to implement the changes and resolve any resulting conflicts in the code. Creating Plumb was a game changer for us. It saved us hundreds of engineering hours and empowered our prompt engineer (who is also a product manager) to iterate on prompts directly in the pipeline, without touching code.

But the best part? It freed up our engineering team to work on more critical projects that require creative problem-solving and provide more value to users.

👍 In: A visual WSYIWIG overview of the pipeline that is the pipeline

👎 Out: Manually creating and maintaining a pipeline diagram to match with the code

Plumb provides a node-based, drag-and-drop visual pipeline designer. But that’s a mouthful, so we call it a WSYIWIG pipeline. With a WSYIWIG pipeline engineers can be confident that what you see is what you get in building, testing, and production. And product teams can quickly put together prototypes to assess the feasibility of new ideas. Teams can be confident that what they see in Plumb matches what’s in production.

Even better? A visual WSYIWIG approach to building with AI quickly enables shared understanding between engineering and non-engineering teams. You can easily explain a pipeline to non-technical prompt engineers, show work in progress to your CEO, and ease the cognitive burden of remembering how a pipeline works step by step.

👍 In: Knowing that you won’t break a pipeline when making changes.

👎 Out: Breaking a pipeline every time you iterate on it.

With Plumb, developers know with confidence that a step in the pipeline can be iterated on and modified easily with type safety intact. This is because Plumb uses static type inference in each step of a pipeline.

For non-technical folks, we’ve been using the metaphor of Mad Libs internally to describe why static type inference is important. It’s the difference between doing Mad Libs with and without a label under each blank spot. A label gives context about which type of word players should come up with to fill in the blanks. Removing the label creates more guesswork and confusion for players — they can probably use context to fill it in, but do they know with certainty? Nope.

A quick exercise 🤓

Try doing the examples below, first without the labels and then with the labels. Which one was easier to do? Was the quality of your experience better with or without the labels?


Without labels (aka without static type inference).


With labels (aka with static type inference).

Not only does Plumb give engineers peace of mind knowing that a pipeline won’t break with every change, it also increases the pipeline build quality. This is especially useful when experimenting with new ideas while trying to avoid making changes to the code.

👍 In: Building a pipeline at the same time as you’re designing it.

👎 Out: Designing a pipeline, then coding it separately.

We’ve talked about the benefit of using a visual WSYIWIG pipeline and the ability to iterate on each step of a pipeline for non-technical roles. Now let’s talk about why this is great for engineers beyond saving time and energy.

Designing a pipeline and building a pipeline previously required engineers to be constantly context switching between:

  1. The optimal way to design a pipeline — what are the actual steps of the pipeline needed to get to a particular output? What order of steps makes the most sense?

  2. The optimal way to build a pipeline — what’s the best way to build it with code?

Coding often interrupts the creative flow of designing a pipeline but engineers don’t have to worry about that in Plumb. Rather than translating design into code, developers just drag and drop working components of a pipeline that don’t need to be tested or coded from scratch. Re-ordering the steps of a pipeline or changing prompt models takes seconds, minimizing friction and interruption to an engineer’s workflow. With the speed at which AI moves, it makes a huge difference having the ability to quickly experiment with and validate ideas.

👍 In: Running a pipeline by calling an API.

👎 Out: Manually implementing an AI pipeline with code.

Manual implementation is prone to error and there’s no guarantee that prompt performance in production will match the prompt performance in testing.

Plumb provides a built-in pipeline processor so that engineers can run a pipeline by simply calling an API. Engineers can run pipelines end-to-end and see step-by-step results in testing and know that it will match production performance, making it much easier to spot and debug errors in production (as well as easing the cognitive burden of engineering).

Step-level testing gives you the ability to test and analyze the performance and output of each individual prompt in the pipeline. Here’s an example from our experience building Supermanage that illustrates why someone might care about this:

  • Supermanage is an AI-powered Slack app that analyzes messages in public channels and creates a snapshot that covers a direct report’s contributions, challenges, and sentiment over the course of a week.

  • Problem: The contributions section is a mess. Lots of public channels are littered with messages about people’s personal lives. Our team needed to figure out how to filter that content out of work-related sections of the snapshot.

  • Solving the problem without step-level testing: The only way we could test the prompts used in Supermanage was by running the entire pipeline end-to-end. However, we couldn’t check the output of Prompt #1 or Prompt #2 within the pipeline to see how it affected each step. The only artifact we could analyze was the final snapshot. We made guesses about which prompt wasn’t working and hoped that the changes we implemented would result in more accurate snapshots — sometimes this worked, but most of the time it didn’t. But worst of all, it left our team feeling extremely frustrated, confused, and shocked that there wasn’t a better way to do this.

  • Solving the problem with step-level testing: We tested Prompt #1 to see how it analyzed each Slack message as either personal or professional.

  • Next, we tested Prompt #2 to see how it analyzed the professional messages as a task, request, or project that has been completed.

  • Now we were able to see the output of each step and concretely know which prompt is analyzing message incorrectly: Prompt #1 was analyzing many personal messages as professional.

  • We saw how the output of Prompt #1 affected the output of Prompt #2 and immediately identified what needed to be fixed.

Step-level testing enables you to debug a complex AI pipelines without touching code.


Conclusion

Plumb lets you make cool shit with confidence. Engineers can feel assured that Plumb’s inclusion of a WSYIWIG pipeline, step-level observability and testing, static type inference, node-based editor, and pipeline processor will result in high-quality AI features that their non-technical peers can collaborate on with them.