This report has been written by Anna Geller. Find out about her personal highlights and key takeaways from day 1 at Berlin Buzzwords 2023, starting with the Keynote:
The conference began with a keynote by Jennifer Ding, a Research Application Manager at The Alan Turing Institute. She emphasized that most AI production is presently limited to a few companies in a few countries, creating a need for alternative pathways that allow more people to contribute to building, applying, and governing LLMs.
Not everyone has the same vision for what the future of AI should look like. This multiplicity of visions is both a beauty and danger. The speaker, therefore, has raised the question, who should define the future of AI? Should this be just a handful of large organizations striving for market dominance?
The talk highlighted multiple facets of Open Artificial Intelligence (open AI), including being open for:
- faster innovation — when multiple people work on the same problems, they help the field progress faster,
- community building — HuggingFace being called the GitHub of AI,
- empowering those outside of the tech bubble,
- creating new forms of AI and AI production that can benefit more people,
- feedback and auditability,
- diverse participation, including global partnerships (given the international impact of AI).
The keynote concluded with an engaging discussion about building AI production pipelines that live up to the aspirational value of “openness.” The audience expressed the need for an open process and accessibility to weights and data. Finally, the talk highlighted the importance of cutting through the hype and focusing on addressing problems responsibly.
Selected Talks and Sessions
The conference featured a wide range of talks covering data infrastructure, search, stream processing, and, naturally, AI. There were always four parallel talks, so I had to make a difficult choice of which one I’ll pick. Below, you can find a brief summary of the talks I was able to attend.
Laptop-sized ML for Text, with Open Source
Nick Burch, a Director of Engineering at FLEC, discussed how to use open-source tools and pre-trained models to work with LLMs on your laptop. The speaker began with an introduction to embeddings, language models, and large language models (LLMs).
The session then gave an overview of a wide range of open LLMs (with various degrees of openness):
- LLaMA — the code of LLaMA is open, but the model is not
- OpenLLaMA — a permissively licensed open-source reproduction of Meta AI’s LLaMA 7B trained on over a trillion tokens using the RedPajama dataset; actively in progress (last model released on 15th of June 2023)
- Stanford Alpaca — a fine-tuned version of LLaMA 7B aimed at instructions following
- Databricks Dolly and then Databricks Dolly 2.0
- MosaicML MPT-7B — family of pretrained transformers, which can be used as-is or further refined with your own data/instructions
- BigScience Bloom — released under the BigScience Responsible AI License, more open than LLaMA but less open than the MPT-7B family
- Guanaco and QLoRA — Quantized Low-Rank Adapters method allowing for fine-tuning LLMs rapidly on a single GPU.
Since the cost of fine-tuning has come down, you can do a lot just on your laptop.
The talk was also a great reminder that it’s important to understand the fundamentals and try the simplest solutions first. While LLMs on OpenAI’s or Google’s scale are applicable to more tasks, they involve cost and sharing your data, and many open-source-based models are often good-enough (if not better) for many real-world problems. However, the speaker warned the audience how early we still are in steering LLM’s behavior, interpreting their inner workings, and making their results more predictable and less prone to issues such as prompt injection.
Migrate Data, Mesh in mind
Aydan Rende, Senior Data Engineer at Kleinanzeigen, discussed how they migrated from Hadoop to cloud-based data pipelines using dbt and Airflow. Even though they faced some challenges along the way, including difficulties debugging legacy data flows and dealing with domain ownership issues, the migration helped them adopt Data Mesh and save costs.
How to train your general-purpose document retriever model
In this talk, Tom Veasey and Quentin Herreros from Elastic discussed the challenges of training a state-of-the-art general-purpose model for document retrieval and focused on the learned sparse model (LSM) architecture. The talk covered LSMs and challenges in training language models effectively.
The session provided an overview of the key ingredients of the full training pipeline and useful lessons learned along the way. This was a very technical talk. Here are some notes:
- The results using ELSER were competitive in Information Retrieval to OpenAI for their use case using a much smaller model.
- They used a cross-encoder distillation approach to make the model robust to the problem of a partially labeled dataset. New LLMs might be good candidates for distillation to generate candidate queries and train a third teacher model for fine-tuning.
- How you structure data within batches matters. The speakers evaluated various approaches.
Apache Airflow in Production — Bad vs. Best Practices
In this talk, Bhavani Ravi, an independent consultant, covered common pitfalls when using Apache Airflow in production and potential options to mitigate those issues. Here are some of the key points from the session:
- Beware of Copilot, as it will generate legacy code that won’t work in the current Airflow version.
- Never use the latest release, as it may have regression issues. Pick 2–3 versions earlier unless you need a specific newly released feature.
- Use Postgres as a metadata database (even though other options such as MySQL are available).
- Use Kubernetes Executor for most workloads. Use Celery if you are willing to manage or scale workers 24/7.
- Optimize your DAGs to be runnable locally.
- Use Infrastructure as Code.
- Store the logs of your components (such as the scheduler or web server) and make them searchable — otherwise, the scheduler can go down, and you don’t know why your DAGs haven’t even started.
- Lock your requirements.txt and package each script task into Docker to avoid dependency issues.
- Use Community Slack or Stack Overflow if you have questions.
Tip of the Iceberg
Fokko Driesprong, a Senior Open Source Software Engineer at Tabular, talked about Apache Iceberg. This open table format brings reliability and simplicity of SQL tables to files in a data lake. Iceberg is an abstraction layer on top of files in your data lake. It makes them vendor-agnostic and allows you to process the same data (e.g., Parquet files) using various query engines, including performing ACID operations.
The speaker introduced Iceberg and its history and explained key concepts, including metadata, manifest lists, and manifest itself, as well as the schema, partition, and sorting evolution. The session ended with a live demo using PyIceberg and a lively discussion with the audience.
Joining Dozens of Data Streams in Distributed Stream Processing Systems
Yingjun Wu, the founder of RisingWave Labs, explained best practices for joining data streams at scale, incl. joining mechanisms (such as binary joins and delta joins), as well as their pros and cons. As a motivation for the talk, Change Data Capture is often a good approach, but to get real-time insights (e.g., what happened a minute ago), you usually need to combine data from multiple sources in real time. However, current stream processing systems break when you try to join multiple streams.
The speaker shared lessons learned about optimizing the performance of distributed systems and leveraging decoupled compute-storage architecture to reduce join costs and recover from failure.
Ingesting over 4 million rows a second on a single instance
In this session, Javier Ramirez, a Developer Advocate at QuestDB, discussed the technical decisions and trade-offs that were made when building QuestDB, an open-source time-series database developed mainly in JAVA. He presented the history of some of the changes that they have gone through over the past two years to deal with late and unordered data, non-blocking writes, read replicas, or faster batch ingestion.
Panel: Which search engine should you choose?
Moderator, Charlie Hull from OpenSource Connections, discussed the pros and cons of various search engines with panelists representing Elasticsearch, Solr, Vespa, Qdrant, and Weaviate. The debate was very balanced. Among others, the participants were asked to explain:
- what each tool is good at,
- what each tool is not good at, but people use it for that anyway,
- typical use cases,
- how they support AI,
- their approach to community building,
- their future roadmap.
Below is a summary of comments from each tool provider:
It’s best for rankings, recommendation systems, and vector searches in Information Retrieval. It’s not good at log analytics, but it’s well-positioned for AI use cases, especially given its GPU-accelerated inference. It was interesting to hear a competitor-challenging comment claiming that Elastic is not open-source but only source-available. But that was, unfortunately, the only hot take from the entire panel. Vespa’s roadmap is aimed at lowering the barrier to get started.
Apache Solr is good at both traditional lexical search and vector search, with a focus on scalability and mentioning the oldest community in the search space as a differentiator. It’s not good at replacing databases— quote: “Don’t use Solr as a database”. The roadmap includes improvements in the areas of REST API, security, and simplification of the UX.
It’s an AI-native vector search engine (stored vectors can be easily provided as context to LLMs to mitigate the hallucination problem), and its highlighted benefits are integration with existing AI models, low latency, and ease of getting started. It’s not good at time-series use cases. They don’t have many contributors, but they measure the success of the community’s health by how many community members answer each other’s questions in their public Slack. Their roadmap includes improvements to the scalability of tenants and compression techniques (to compress the vector space as a dimensionality reduction mechanism).
They are good at being a general-purpose search engine. They integrate with LangChain and HuggingFace so that your private data that is already stored in Elastic can be used directly as a context for LLMs. They are good at log analytics and scalability, supporting a 3-digit number of nodes. They are not good at high transaction rates (e.g., use cases when something gets updated five times per second) and as a way to store blobs (even though some users still tried it). The community has always been a scaling function for them. On their roadmap is a serverless offering, backing up indices to S3.
Their main advantage is the simplicity of storing embeddings. Not suitable for non-vector data (e.g., JSON documents), but great for storing your custom documents for LLMs. They have a Discord community to answer questions and already many code contributions from the community.
Overall, attending Berlin Buzzwords 2023 has been a fantastic experience so far. Already the first conference day provided a great opportunity to learn about the latest advancements in search and AI while also networking with so many passionate and curious professionals. I’m looking forward to tomorrow’s talks.