Beautiful AI-based Products

If your product has a UX element that aligns user interactions with the goal of your AI models, then you have a beautifully machine-learning-based product.

Think of Midjourney. You write a prompt, get 4 generated images. Midjourney's goal is to generate the best possible images given a prompt. The user's desire is to get the best possible image from a prompt. There is 100% alignment between the product's and the user's goals. And you don't need to give advanced instructions to Midjourney users on selecting the best picture. Their human intuition guides them effortlessly. The result is invaluable data for Midjourney to train their models.

Think of Netflix. Occasionally, while you're watching Netflix, you will be asked, 'Are you still watching X?' Netflix understands how crucial this subtle piece of information is for their recommender systems and the user experience of the product. The recommender system needs to know if you are actually watching to determine what to recommend next. And if you fell asleep, it's convenient that Netflix can predict this and automatically stop the series for you, so you don't have to search for where you stopped watching. When you are prompted with 'Are you still watching X?', you as a user have a 100% interest and need to give Netflix almost the perfect information they need to evaluate their system.

More examples of products that do this are Google Search and every social media feed. Huge companies are built by integrating UX and AI to craft superior products. Yet, not many consider AI and UX in this integrated manner. It should be a consideration in every UX decision made.

Prompting Patterns: The Clarification Pattern

The more I use ChatGPT and develop software using LLM APIs, the more I realize that context is essential for LLMs to provide high-quality answers. When I use ChatGPT and receive unsatisfactory answers, it's typically due to a lack of information about the problem I'm presenting or my current situation. I often notice that I might be ambiguous about the task I want ChatGPT to solve, or ChatGPT perceives the issue in a manner I hadn't anticipated. However, I've observed that by adopting a simple pattern, I can significantly reduce these challenges, consistently leading to more accurate responses.

The pattern is as follows:

  1. Me: I instruct ChatGPT to perform a task. I tell it not to respond immediately but to ask clarifying questions if any aspect of my instruction is unclear.
  2. ChatGPT: Asks clarifying questions.
  3. Me: I answer the questions and tell it again not to execute the instruction but to ask further clarifying questions if any part of my answers is unclear.
  4. ChatGPT: It does one of two things.
    a) Asks additional clarifying questions. If this happens, return to step 3.
    b) Indicates it has no further questions. If this is the case, proceed to step 5.
  5. Me: I give the command to execute the instruction.

I call this the "Clarification Pattern." Recognizing this approach shifted my perspective from viewing prompt engineering solely as individual prompts to thinking in terms of human-AI conversations. Through these dialogues, I can build valuable context by clarifying ambiguities in both my understanding and that of ChatGPT, thus providing ChatGPT with the optimal conditions to deliver an excellent response.

Text Classifiers are an Underrated Application of LLMs

Before LLMs really became a thing, getting up and running with a text classifier for a non-standard problem from scratch, including the annotation of a dataset for training, would probably take at least 3 weeks of work hours. That amounts to 7,200 minutes. Today, getting up and running with a classifier using LLMs requires only writing a prompt, which takes about a minute.

That's a 7,200x productivity gain in the initial process of working with text classifiers.

One thing to note, however, is that in the 1-minute prompt scenario, you have collected zero data and therefore have nothing to measure your classifier's performance against. However, since you have a classifier, you can annotate much more efficiently using an active learning approach, and you have 7,199 minutes to knock yourself out with evaluating your classifier.

Everybody talks about chatbots and agents as the hot new thing, but honestly, a 7,200x productivity gain in text classifier development is also pretty huge!

Thought Debugging

Previously, tuning text classifiers required annotated datasets. This process entailed splitting the datasets into training and test sets, fine-tuning the model, and measuring its performance. Often, improving accuracy meant analyzing incorrect predictions to hypothesize about what the model failed to understand about the problem. Solutions for improving performance could involve adding more annotations, tweaking the annotation protocol, or adjusting preprocessing steps.

However, with the rise of Large Language Models (LLMs), the focus has shifted towards crafting effective prompts rather than constructing datasets. If a model doesn't respond accurately to a prompt, it is fine-tuned by adjusting the prompt to accommodate potential misunderstandings. A significant advantage of LLMs is their ability to explain the reasoning behind their predictions. This interactive approach allows users to probe the model's understanding and further refine prompts. Moreover, these models can express their thought processes, not only enhancing their performance but also introducing a technique that can be termed "Thought Debugging". This allows for the diagnosis and correction of their cognitive processes.

Right and Wrong Things to Work On

It seems like there are two kinds of things you can work on: right and wrong things. The right things bring you closer to your goals, the wrong things take you further away from your goals or nowhere at all. I firmly believe that one's ability to choose the right things to work on largely determines one's success.

Right and wrong things to work on must always be viewed relative to a goal. One thing may be the right thing to work on to achieve goal A, but the wrong one if you want to achieve goal B.

As a person, I am very enthusiastic and full of ideas. This may seem like two positive attributes, but the combination often leads to a lack of focus. For there are many good ideas, but not all of them lead towards the same goal. If you are too easily seduced by your abundance of ideas, you end up being unfocused, and this leads you very inefficiently towards your goal.

In my opinion, that's why it's important to spend time figuring out what your goals are. It sounds simple, but personally, I find it much easier said than done! For what is the actual dream scenario if you think 5-10 years into the future? This question deserves that you make an effort to answer it, and that you regularly revisit it to see if your actions align with your long-term goals.

The clearer you are on your goals, the better you become at determining what the right and wrong things to work on are in relation to achieving your goals.

A Process for Building LLM Classifiers

Large language models (LLMs) can be prompt-engineered to solve a wide variety of tasks. While many consider chat as the primary use case, LLMs can also be used to train traditional classifiers. 

Before the rise of advanced generative text-to-text models, crafting a custom text classifier was a time-consuming process that required extensive data collection and annotation.

Nowadays, you can get your hands dirty with LLMs without worrying about annotating data. This is great as it saves you a lot of time. However, it also becomes tempting to bypass best practices for building robust machine learning applications. 

When there's no need to create a training dataset, the temptation of simply hand-tuning a prompt based on a few examples becomes strong. You might convince yourself that it will generalize to any data that might be presented to it.The challenge is, without annotations to measure accuracy or a method to assess your prompt, you can't determine its robustness once deployed.

In my recent work with LLMs, I have thought a lot about this and have developed a process that, in my experience, enables the construction of robust LLM classifiers. This method is not only more efficient but also more enjoyable to fine-tune compared to the old school way of doing it.

The following process will help you craft more robust and reliable LLM modules.

Step 1: Collect Dataset

Collect a raw, unannotated dataset representative of the data on which your classifier will be used in real-world scenarios. The dataset's size should provide the desired significance level when assessing the classifier, while remaining realistic for you to annotate and not exhausting your API call budget with OpenAI. Divide the dataset into validation and test subsets.

Step 2: Create Initial Prompt

Construct an initial prompt you believe will be effective. It should yield two outputs. The first output should articulate the model's thoughts when determining which class to assign to the input text. 

This will be useful for iterative improvements to the prompt, ensuring it aligns with the task. In accordance with the chain-of-thought method, this should improve its performance and enhance explainability. The second output should specify the class you want the LLM to categorize.

The output format should look something like this:

{ "thoughts": <rationale behind classification here>, "class": <the class the model has classified the example as here> }

Test the prompt on a few dataset samples to get a feeling og the model's comprehension of the task. Dedicate some time to refining it for optimal results. You should be confident that the LLM's has a reasonable understanding of the task.

Step 3: Run, Inspect and Annotate

Now run the hand-tuned prompt on the entire validation dataset.  For reproducibility, set the temperature to 0. Review all classified samples. If the LLM's categorization is inaccurate, correct it and document areas of misunderstanding. Use the thoughts output to understand its decision-making process. 

During annotation, you'll almost certainly discover complexities and nuances in the problem you're trying solve that didn't initially think of. Also you will likely discover ambiguities in the instruction you asked the LLM to follow, where you will have to be more clear in what you want it to do. In some cases the LLM's limits of understanding will also reveal themselves. Document these findings in an "annotation protocol", which outlines rules for managing edge cases.

Step 4: Measure Performance of Prompt

Upon completing step 3, you'll have an annotated validation dataset. This allows for the evaluation of the prompt's predictive performance. Measure the performance to gain insight into the prompt's predictive capabilities.

Step 5: Adjust Prompt

Post step 5, you'll have written notes detailing cases where the LLM misclassified data. From this, formulate a hypothesis on potential prompt modifications to enhance its accuracy. Adjust the prompt in a which you think will mitigate the errors.

Step 6: Iterate

After step 6, run the adjusted prompt on the validation dataset and measure its performance. Ideally, results should improve post-adjustment. Analyze incorrect classifications and take notes to understand the model's behavior. Repeat this process until you are satisfied with the prompt's performance or you believe that you have reached maximum performance.

Step 7: Measure Performance on Test Dataset

Now is the time. It's time to follow best practices, like the diligent and competent ML engineer you are. Now, you need to run the tuned prompt on a test set. However, your test set isn't annotated yet, presenting a significant temptation to skip this step. But you know you have to do it! If you do this, you will likely find that performance on the test dataset is a little worse. This is expected and is because you have probably overfitted your prompt to the validation dataset.


Congratulations, you now have an LLM classifier to solve a problem for you! For now, this is the best process I have. If you know of a better approach, I would love to hear from you. Additionally, at, where I work as an ML Engineer, we are constantly striving to crystallize our learnings into code. Specifically, we are developing a Python package called prompt-functions, which, in our experience, makes this process much smoother. We would love to continue the conversation on how to manage LLM applications, so please feel free to open an issue, send us a pull request or simply just reach out to me 🤗

The Human-to-Screen Ratio

I feel like I spend way too much time in front of a screen coding, compared to the time I spend interacting with people. When I talk to other people about this, some say the exact opposite. Perhaps it would be useful to define a metric for this. This way, people can measure it and strive to achieve their desired balance between screen and human interaction. We could call it the human-to-screen (H2S) metric.


H: time spent interacting with humans 
S: time spent in front of a screen

The ratio can then be defined as:



  • If > 1, more time is spent interacting with humans than in front of a screen.
  • If = 1, equal time is spent on both activities.
  • If < 1, more time is spent in front of a screen than interacting with humans.

In my daily work, I estimate my H2S ratio to be roughly 0.25. Ideally, I'd prefer it to be around 2, or maybe even higher. Interestingly, when I discuss this with others, it seems the optimal H2S ratio varies among individuals.

What's your ideal human-to-screen ratio? Is it optimal for you, or do you wish for a higher or lower ratio?

Kan ChatGPT Skabe Ny Viden? 👨‍🔬

Forleden havde vi hos en diskussion om hvorvidt LLM'er kan skabe ny viden 🤔

Hver gang man støder ind et spørgsmål af typen:

"𝗞𝗮𝗻 𝗔𝗜 𝗫?"

Så føler jeg tit at det er brugbart at stille spørgsmålet:

"𝗞𝗮𝗻 𝗺𝗲𝗻𝗻𝗲𝘀𝗸𝗲𝗿 𝗫?"

Altså i det her tilfælde:

"𝗞𝗮𝗻 𝗺𝗲𝗻𝗻𝗲𝘀𝗸𝗲𝗿 𝘀𝗸𝗮𝗯𝗲 𝗻𝘆 𝘃𝗶𝗱𝗲𝗻?"

Smag lige på de to spørgsmål når de står ved siden af hinanden:

𝟭. 𝗞𝗮𝗻 𝗔𝗜 𝘀𝗸𝗮𝗯𝗲 𝗻𝘆 𝘃𝗶𝗱𝗲𝗻?

𝟮. 𝗞𝗮𝗻 𝗺𝗲𝗻𝗻𝗲𝘀𝗸𝗲𝗿 𝘀𝗸𝗮𝗯𝗲 𝗻𝘆 𝘃𝗶𝗱𝗲𝗻?

Svaret på nummer 1. kender jeg ikke, men jeg er faktisk heller ikke sikker på svaret på 2. 😅

Når man tænker over det, så er spørgsmålet måske lidt mere nuanceret end først antaget. For er ny viden ikke noget man erfarer fra den virkelige verden?

Og er det så ikke unfair over for LLM'er at mennesker i langt højere grad har adgang til den virkelige verden 🤔

Langt Fra Alle Bruger ChatGPT

Jeg bruger seriøst ChatGPT konstant, når jeg arbejder! Jeg bruger det til at:

  • skrive kode
  • visualisere data
  • forklare kode, som andre har skrevet
  • teste kode
  • sparre omkring best practices og arkitektur
  • lære at kode
  • og meget, meget mere

Et slag på tasken er, at ChatGPT i gennemsnit øger min produktivitet med 35%. Til nogle opgaver øger ChatGPT min produktivitet med adskillige 1000%. Der er sågar ting, jeg kaster mig ud i, fordi jeg har ChatGPT, som jeg ellers ikke ville gøre. Og så er der det aspekt, at jeg også overordnet skriver bedre kode, fordi jeg har en sparringspartner, som altid er online.

Når jeg taler med mine kollegaer i tech-branchen, er det langt fra alle, der bruger ChatGPT. Når jeg tænker på, hvor meget værdi jeg får ud af ChatGPT, kan jeg slet ikke fatte, at alle udviklere ikke bruger det vildt meget!

Fra mit synspunkt ligger der stadigvæk et kæmpe uudnyttet produktivitetspotentiale og lurer lige om hjørnet!

ChatGPT er Ikke en Database

Den generative AI-teknologi er i rivende udvikling. Dog har avancerede modeller som GPT-3.5 og GPT-4 en udfordring: De har tendens til at generere udsagn, som er usande. ChatGPT, en populær bruger af generativ AI, møder ofte kritik for netop dette.

Spørgsmålet er, om vi forventer, at generativ AI skal agere som en alvidende vidensdatabase, der har styr på alle fakta i verden. Eller er dette måske den forkerte tilgang til at forstå disse store sprogmodeller?

En alternativ tankegang kan være at opfatte modeller som GPT-3.5 og GPT-4, også kaldet Large Language Models (LLM'er), som små ræsonnementmotorer. Motorer som kan løse mere fleksible og komplekse problemer end traditionel programmering. Ved at se generativ AI i dette lys, kan det måske udvide vores forståelse af teknologien og gøre os i stand til at tænke i gunstige anvendelsesmuligheder inden for AI-teknologi.