Right and Wrong Things to Work On

It seems like there are two kinds of things you can work on: right and wrong things. The right things bring you closer to your goals, the wrong things take you further away from your goals or nowhere at all. I firmly believe that one's ability to choose the right things to work on largely determines one's success.

Right and wrong things to work on must always be viewed relative to a goal. One thing may be the right thing to work on to achieve goal A, but the wrong one if you want to achieve goal B.

As a person, I am very enthusiastic and full of ideas. This may seem like two positive attributes, but the combination often leads to a lack of focus. For there are many good ideas, but not all of them lead towards the same goal. If you are too easily seduced by your abundance of ideas, you end up being unfocused, and this leads you very inefficiently towards your goal.

In my opinion, that's why it's important to spend time figuring out what your goals are. It sounds simple, but personally, I find it much easier said than done! For what is the actual dream scenario if you think 5-10 years into the future? This question deserves that you make an effort to answer it, and that you regularly revisit it to see if your actions align with your long-term goals.

The clearer you are on your goals, the better you become at determining what the right and wrong things to work on are in relation to achieving your goals.

A Process for Building LLM Classifiers

Large language models (LLMs) can be prompt-engineered to solve a wide variety of tasks. While many consider chat as the primary use case, LLMs can also be used to build traditional classifiers. 

Before the rise of advanced generative text-to-text models, crafting a custom text classifier was a time-consuming process that required extensive data collection and annotation.

Nowadays, you can get your hands dirty with LLMs without worrying about annotating data. This is great as it saves you a lot of time. However, it also becomes tempting to bypass best practices for building robust machine learning applications. 

When there's no need to create a training dataset, the temptation of simply hand-tuning a prompt based on a few examples becomes strong. You might convince yourself that it will generalize to any data that might be presented to it.The challenge is, without annotations to measure accuracy or a method to assess your prompt, you can't determine its robustness once deployed.

In my recent work with LLMs, I have thought a lot about this and have developed a process that, in my experience, enables the construction of robust LLM classifiers. This method is not only more efficient but also more enjoyable to fine-tune compared to the old school way of doing it.

The following process will help you craft more robust and reliable LLM modules.

Step 1: Collect Dataset

Collect a raw, unannotated dataset representative of the data on which your classifier will be used in real-world scenarios. The dataset's size should provide the desired significance level when assessing the classifier, while remaining realistic for you to annotate and not exhausting your API call budget with OpenAI. Divide the dataset into validation and test subsets.

Step 2: Create Initial Prompt

Construct an initial prompt you believe will be effective. It should yield two outputs. The first output should articulate the model's thoughts when determining which class to assign to the input text. 

This will be useful for iterative improvements to the prompt, ensuring it aligns with the task. In accordance with the chain-of-thought method, this should improve its performance and enhance explainability. The second output should specify the class you want the LLM to categorize.

The output format should look something like this:

{ "thoughts": <rationale behind classification here>, "class": <the class the model has classified the example as here> }

Test the prompt on a few dataset samples to get a feeling og the model's comprehension of the task. Dedicate some time to refining it for optimal results. You should be confident that the LLM's has a reasonable understanding of the task.

Step 3: Run, Inspect and Annotate

Now run the hand-tuned prompt on the entire validation dataset.  For reproducibility, set the temperature to 0. Review all classified samples. If the LLM's categorization is inaccurate, correct it and document areas of misunderstanding. Use the thoughts output to understand its decision-making process. 

During annotation, you'll almost certainly discover complexities and nuances in the problem you're trying solve that didn't initially think of. Also you will likely discover ambiguities in the instruction you asked the LLM to follow, where you will have to be more clear in what you want it to do. In some cases the LLM's limits of understanding will also reveal themselves. Document these findings in an "annotation protocol", which outlines rules for managing edge cases.

Step 4: Measure Performance of Prompt

Upon completing step 3, you'll have an annotated validation dataset. This allows for the evaluation of the prompt's predictive performance. Measure the performance to gain insight into the prompt's predictive capabilities.

Step 5: Adjust Prompt

Post step 5, you'll have written notes detailing cases where the LLM misclassified data. From this, formulate a hypothesis on potential prompt modifications to enhance its accuracy. Adjust the prompt in a which you think will mitigate the errors.

Step 6: Iterate

After step 6, run the adjusted prompt on the validation dataset and measure its performance. Ideally, results should improve post-adjustment. Analyze incorrect classifications and take notes to understand the model's behavior. Repeat this process until you are satisfied with the prompt's performance or you believe that you have reached maximum performance.

Step 7: Measure Performance on Test Dataset

Now is the time. It's time to follow best practices, like the diligent and competent ML engineer you are. Now, you need to run the tuned prompt on a test set. However, your test set isn't annotated yet, presenting a significant temptation to skip this step. But you know you have to do it! If you do this, you will likely find that performance on the test dataset is a little worse. This is expected and is because you have probably overfitted your prompt to the validation dataset.

Conclusion

Congratulations, you now have an LLM classifier to solve a problem for you! For now, this is the best process I have. If you know of a better approach, I would love to hear from you. Additionally, at SEO.ai, where I work as an ML Engineer, we are constantly striving to crystallize our learnings into code. Specifically, we are developing a Python package called prompt-functions, which, in our experience, makes this process much smoother. We would love to continue the conversation on how to manage LLM applications, so please feel free to open an issue, send us a pull request or simply just reach out to me 🤗





The Human-to-Screen Ratio

I feel like I spend way too much time in front of a screen coding, compared to the time I spend interacting with people. When I talk to other people about this, some say the exact opposite. Perhaps it would be useful to define a metric for this. This way, people can measure it and strive to achieve their desired balance between screen and human interaction. We could call it the human-to-screen (H2S) metric.

Let:

H: time spent interacting with humans 
S: time spent in front of a screen


The ratio can then be defined as:

H/S

Where:

  • If > 1, more time is spent interacting with humans than in front of a screen.
  • If = 1, equal time is spent on both activities.
  • If < 1, more time is spent in front of a screen than interacting with humans.

In my daily work, I estimate my H2S ratio to be roughly 0.25. Ideally, I'd prefer it to be around 2, or maybe even higher. Interestingly, when I discuss this with others, it seems the optimal H2S ratio varies among individuals.

What's your ideal human-to-screen ratio? Is it optimal for you, or do you wish for a higher or lower ratio?

Kan ChatGPT Skabe Ny Viden? 👨‍🔬

Forleden havde vi hos seo.ai en diskussion om hvorvidt LLM'er kan skabe ny viden 🤔

Hver gang man støder ind et spørgsmål af typen:

"𝗞𝗮𝗻 𝗔𝗜 𝗫?"

Så føler jeg tit at det er brugbart at stille spørgsmålet:

"𝗞𝗮𝗻 𝗺𝗲𝗻𝗻𝗲𝘀𝗸𝗲𝗿 𝗫?"

Altså i det her tilfælde:

"𝗞𝗮𝗻 𝗺𝗲𝗻𝗻𝗲𝘀𝗸𝗲𝗿 𝘀𝗸𝗮𝗯𝗲 𝗻𝘆 𝘃𝗶𝗱𝗲𝗻?"

Smag lige på de to spørgsmål når de står ved siden af hinanden:

𝟭. 𝗞𝗮𝗻 𝗔𝗜 𝘀𝗸𝗮𝗯𝗲 𝗻𝘆 𝘃𝗶𝗱𝗲𝗻?

𝟮. 𝗞𝗮𝗻 𝗺𝗲𝗻𝗻𝗲𝘀𝗸𝗲𝗿 𝘀𝗸𝗮𝗯𝗲 𝗻𝘆 𝘃𝗶𝗱𝗲𝗻?

Svaret på nummer 1. kender jeg ikke, men jeg er faktisk heller ikke sikker på svaret på 2. 😅

Når man tænker over det, så er spørgsmålet måske lidt mere nuanceret end først antaget. For er ny viden ikke noget man erfarer fra den virkelige verden?

Og er det så ikke unfair over for LLM'er at mennesker i langt højere grad har adgang til den virkelige verden 🤔




Langt Fra Alle Bruger ChatGPT

Jeg bruger seriøst ChatGPT konstant, når jeg arbejder! Jeg bruger det til at:

  • skrive kode
  • visualisere data
  • forklare kode, som andre har skrevet
  • teste kode
  • sparre omkring best practices og arkitektur
  • lære at kode
  • og meget, meget mere

Et slag på tasken er, at ChatGPT i gennemsnit øger min produktivitet med 35%. Til nogle opgaver øger ChatGPT min produktivitet med adskillige 1000%. Der er sågar ting, jeg kaster mig ud i, fordi jeg har ChatGPT, som jeg ellers ikke ville gøre. Og så er der det aspekt, at jeg også overordnet skriver bedre kode, fordi jeg har en sparringspartner, som altid er online.

Når jeg taler med mine kollegaer i tech-branchen, er det langt fra alle, der bruger ChatGPT. Når jeg tænker på, hvor meget værdi jeg får ud af ChatGPT, kan jeg slet ikke fatte, at alle udviklere ikke bruger det vildt meget!

Fra mit synspunkt ligger der stadigvæk et kæmpe uudnyttet produktivitetspotentiale og lurer lige om hjørnet!