The chatbots are coming

What happened at the FORCE LLM hackathon in Stavanger
Author

Matt Hall

Published

December 2, 2023

🤖 This week, seven teams of scientists and data scientists collaborated to explore ideas in large language modeling applied to a large new open dataset. Here’s what happened.

The FORCE consortium, which has hosted hackathons and data science contests before (read this and that), hosted another groundbreaking event last week — the NPD language modeling hackathon. The event took place at the NPD in Stavanger, Norway, on 29 & 30 November, and 1 December 2023.

As in past years, the event was organized by a small team coordinated by Peter Bormann (ConocoPhillips), who not only believes passionately in the importance of open collaboration but is committed to acting on that belief 🙌 Scroll down for the rest of the organizational credits.

One major feature of this event was the large new dataset the team has assembled. This contains almost three million pages of text from various reports published by the Norwegian Petroleum Directorate, Netherlands Oil and Gas and the UK North Sea Transition Authority. It will soon be published under the NLOD 2.0 licence, and should be an exciting resource for the community; we just want to make sure we have taken reasonable steps to protect people’s privacy before publishing it.

Projects

Here’s a very quick rundown of the teams that formed at what I believe was the first public LLM-based hackathon in Norway or in the energy sector (AkerBP ran one of their own a few weeks ago):

  • Anonymizers — Masking personally identifiable information in public datasets.
  • Embedding enthusiasts — Fine-tuning an embeddings model, using cleaner data.
  • Zero-shot chatbots — What kind of questions can chatbots answer about the dataset?
  • Knowledge-graphers — Extracting a knowledge graph and providing it to chatbots.
  • Q&A generators — Generating question-answer pairs for fine-tuning Q&A chatbots.
  • Metadata extractors — Automatically pulling metadata from the dataset.

Tomorrow I will put up another post describing the projects in more detail. When it’s up, you can click here to read it!

Credits

It takes a community of organizers to pull off a community event like this. Here’s a probably incomplete list, apologies if I missed anyone (drop me a line!):

  • Jesse Lord, (Kadme and Fabriq) for the dataset, which will soon be released under an open license.
  • Lukas Mosser, (AkerBP) for the starter notebook and know-how.
  • Paul Cleverley, Infoscience for the named entity tags.
  • Eirik Haughom and Frode Odinsen, Microsoft for the in-event Azure support.
  • The NPD, especially Janke Ro and those involved in FORCE.
  • It was my privilege to facilitate the proceedings, a job I am ill-suited for but enjoy anyway 😅 Thank you to Peter for the opportunity!