Loading….
A Crash Course in Error Handling for Streaming Data Pipeline
Learn how to handle errors in streaming data pipelines using concepts, such as dead-letter queues.
A Fresh Start? The Path Toward Apache Solr's v2 API
Modernization efforts face particular hurdles in large, established OSS projects. Come learn about the community and technical challenges encountered on Apache Solr's path towards revamped HTTP APIs.
A Kafka Client’s Request: There and Back Again
Understand how data moves into and out of Apache Kafka® by taking a look at the producer and consumer request life cycle. Follow a request from an initial call to send() or poll(), all the way to disk
Advanced Search Plays with GraphQL
This demo-heavy workshop scores a hat trick by combining Apache Lucene, MongoDB, and GraphQL to easily build search functionality across data collections and 3rd party APIs into applications.
Alexa, is The Smart Home vision failing?
Amazon's Alexa team has lost billions. Google and Apple's hub aren't great successes. Is the Smart home failing? How can you keep your lights on when they depend on cloud infrastructure to work?
Apache Airflow in Production - Bad vs Best Practices
This talk will explore the bad and best practices when deploying Apache Airflow in a production environment. From common pitfalls such as misconfigured tasks and lack of scalability,
Avoiding Anti-patterns in Technical Communication
Communicating technical knowledge effectively is a core skill for practitioners, but one which is often neglected. We’ll give practical advice on how to (and not to!) communicate technical ideas.
Barcamp
Barcamps are informal sessions, a kind of "un-conference", with a schedule decided on the day.
Big data in the service of reliable news
Data vs. Fake news : using available data to offer a critical view of the world
Boosting Ranking Performance with Minimal Supervision
Using generative Large Language Models (LLMs) to generate synthetic labeled data to train in-domain ranking models. Distilling the knowledge and power of generative LLMs into effective ranking models.
Building MLOps Infrastructure at Japan's Largest C2C E-Commerce Site
The MLOps infrastructure we built to support ML in search at Mercari, Japan’s largest C2C e-commerce platform.
Building On-Ramps for Non-Code Contributors in Open Source
Open source software is so much more than code – docs, community and infra need maintaining. How do you attract and keep non-code contributors? Let two experienced practitioners show you the way!
Building Real-Time Applications: Cyclist Crash Detection
In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes.
Catch the fraud — with observability and analytics
This is the story of how to catch cheaters by combining observability and analytics data through the power of search.
ChatGPT is lying, how can we fix it?
Large Language Models are great in grammar but tend to confabulate. Building a reliable knowledge base might be a way to solve it. Here is how.
ClickHouse: what is behind the fastest columnar database
Columnar databases seem to be full of mysteries and confusion. In this introduction for ClickHouse, we'll take apart its building blocks to see how it achieves its remarkable performance.
Column-level lineage is coming to the rescue
How are the columns containing sensitive data used across the data ecosystem? What input columns were used to produce a given report field? Openlineage can answers those questions automatically.
Connect GPT with your data: Retrieval-augmented Generation
Learn how to build with LLMs, like ChatGPT, and avoid typical pitfalls like hallucination and outdated information. Accompanied by practical code examples using the open source framework Haystack.
Cooking up a new search system: Recipe search at Cookpad
How we successfully transitioned the search system for the world's largest recipe sharing platform to a modern stack – including the wins, fails, team structures, and processes along the way.
Creating chaos in containers
Chaos engineering is hard, in containers it is even harder. This session will show attendees the considerations and get them started on their way to making more resilient applications in the cloud
Cross Data Center Replication in Solr - A new approach
Learn about the motivation that led to the development of the new Cross Data Center (XDC) Replication module in Apache Solr and discover the capabilities it offers making it disaster ready.
Data Ops on Kubernetes with Kubernetes Operators
The rise of cloud-native technologies has revolutionised the way organisations store, process, and manage their data. We'll explore the power that Kubernetes give us to efficiently manage them.
Declarative Data Collections for Portable Parallelism
This talk introduces a novel programming model - the user declares data collections with the properties, and and these declarations can be transparently ported to multiple platforms including GPUs.
Deep dive into an Elasticsearch plugin for query-time joins
Siren Federate is an Elasticsearch plugin for joining inverted indices at query-time. Learn in this talk about its inner workings and how it complements features of Elasticsearch like runtime fields.
Fact Checking Rocks: how to build a fact-checking system
In this infodemic era, fact-checking is becoming a vital task. In this talk, we’ll discover how to build a simple fact-checking system for rock music, leveraging the power of open-source libraries.
From keyword to vector
During this talk, I will take you on my over-a-decade-long journey in search. Starting from having witnessed the inception of Elasticsearch to my current endeavors with Weaviate, I will share my first-hand experience of the evolution, challenges, and lessons learned along the way.
Hadoop Vectored IO: your data just got faster!
We are introducing a new Hadoop Filesystem API called "vectored read" using which we can achieve significant speedups for all big data applications, especially in cloud storage like S3 and ABFS.
Highly Available Search at Shopify
This talk shares the story of how Shopify implemented seamless storage autoscaling for Elasticsearch that powers search for millions of merchants without data loss.
How to Implement Online Search Quality Evaluation with Kibana
Conducting online testing is crucial for assessing a model’s performance in a real-world scenario. This talk explores a customized approach for evaluating ranking models using Kibana.
How to not kill people
As AI grows, software manages more risks to humans. Moving fast and breaking things won't do. We will look at aviation to learn how successful risk management structures might look in software & AI.
How to train your general purpose document retriever model
A practical guide for training learned sparse models to outperform BM25 on zero-shot document retrieval tasks
Ingesting over 4 million rows a second on a single instance
When we set up to write an open source fast time series database, we realised we would need every trick in the book to make it as performant as possible. This talk will show what's inside.
Introducing Multi-valued Vector Fields in Apache Lucene
Multiple vectors in a field dedicated to K-nearest-neighbors search has been a fundamental problem for Apache Lucene for long. This talk describes how this has been finally designed and implemented.
Joining Dozens of Data Streams in Distributed Stream Processing Systems
This talk will explore the techniques and best practices for joining dozens of data streams, focusing on different joining mechanisms, such as binary joins and delta joins, as well as pros and cons.
Kaldb: serverless lucene at petabyte scale
In this talk, we share our experiences, best practices, and lessons learned in designing and operating a serverless Lucene serving system at PB scale.
Laptop-sized ML for Text, with Open Source
Advanced ML models for text may need hundreds of machines, but with open source tools and pre-trained models, you can do a lot just on your laptop or docker container. Discover what and how!
Learning to hybrid search
Combining BM25, neural embeddings and customer behavior with Learning-to-Rank into an ultimate ranking ensemble, with examples on Amazon ESCI e-commerce search dataset.
Migrate Data, <Mesh> in mind
For quite some time, Hadoop served as the data warehouse for Kleinanzeigen. In this presentation, our objective is to provide an overview of our approach, which involves implementing a cloud-based data pipeline with the help of dbt and Airflow.
ML with Domain-Specific Ontology for IT Security Industry
The BSI provides actual data on acute IT threat situations. We developed a system for detecting threats: crawling, automatic analysis with NER, NEL, provision and use of dedicated tools for evaluating
Model Fine-tuning For Search: From Algorithms to Infra
Deep learning for search has become a hot topic, while pre-trained neural nets do not function well as expected. We will discuss the algorithms behind model fine-tuning, and how to scale it up.
No Mean Feat: Upgrading a Customized Solr to Upstream Solr
Learn how the News Search Infrastructure Team at Bloomberg migrated from a customized implementation of Apache Solr to the upstream Apache Solr
Platform Engineering is All About Product
“Platform Engineering,” the latest buzzword, means building an internal platform to improve your SDLC in a way your developers will want to use. Can this be done with engineering skills alone?
Privacy-Preserving Web Search
An ethical overview of how a privacy-focused search engine has to adapt its behavior from crawling to ranking web documents without knowing anything about the user and still be as relevant as possible
Rethinking Autoscaling for Apache Solr using Kubernetes
Apache Solr’s built-in autoscaling is gone, but the need for autoscaling persists. Using Kubernetes’ HPA, the Solr Operator and new Solr APIs, we re-introduce autoscaling for Solr on Kubernetes.
Scalable distributed messaging&streaming with Apache Pulsar
In this session, you'll discover seven Apache Pulsar features that enable you to build amazing event-driven applications and how Apache Pulsar differs from traditional message brokers.
Search saves lives: solving healthcare problems with search
During covid the pressure was on for search. I’ll discuss the challenges of building a search engine matching people to covid test facilities and how the lessons learned can solve healthcare issues.
Searching large data sets in (near) constant time
Tackle large search results by estimating hit count, interpolating a first phase ranking and limiting the returned result set to the most relevant documents in a multi-million document index.
Semantic vs keyword search as context for GPT
If you want to build a chat bot like ChatGPT on your own data, you need to use search to provide the context. Usually semantic search is used, but we've found that keyword search has some pros.
Supercharging your transformers with synthetic query generation and lexical search
This talk will explore dramatic gains in ranking performance from small transformer models, fine-tuned with synthetic query generation and combined with lexical search, and will equip the audience to pursue the same approach using open-source tools.
Synthetic data: when, why, and how
This talk will cover several use cases in which generating synthetic data is useful (or even essential) and introduce a toolbox of practical techniques for synthesizing data in these situations.
The Debate Returns (with more vectors) Which Search Engine?
It's that old question - which search engine should I choose for my project? Elasticsearch, Solr, Opensearch (all based on Lucene), or Vespa, or maybe one of the new vector search engines?
Tiny Flink — Minimizing the memory footprint of Apache Flink
We will explore options to run Apache Flink with a very low resource footprint, allowing users to run full streaming SQL queries or custom streaming applications on JVMs with less than 500mb
Tip of the Iceberg
Apache Iceberg is an open table format that has wide support among open-source and cloud vendors. After this talk, you'll be comfortable with all the concepts and how to use Iceberg.
Towards a decentralized and collaborative search engine
In this session we will share our vision towards an alternative, decentralized and collaborative search engine, from social considerations to technical implementation.
Using Dense Vector search at the EU Publications Office
How dense vector functionality was used to provide several ‘Google-like’ capabilities such as Extractive Answers and knowledge graph search over a large dataset at the EU Publications Office.
Using TensorFlow in a Solr Query Parser
Tutorial for writing Solr Query Parser that use TensorFlow for Java to augment queries.
Vectorize Your Open Source Search Engine
Fascinated by vector search but don't know where to start? Join us to crack the code and leverage the potential of vector search to delight your users.
What defines the “open” in “open AI”?
This talk focuses on unpacking this year’s big buzzwords of “open AI” and “responsible AI” to highlight the range of (sometimes contradictory) activities that exist under these umbrella terms and how
What's coming next with Apache Lucene?
This talk will discuss the ways Apache Lucene might go in the next years. From the perspective of a full-text search engine, it looks like it is feature-complete. So what comes next?
When ms matter: Maximizing query performance in CrateDB
Achieving optimal execution plans in distributed databases is a challenging task. This talk will focus on CrateDB: a distributed SQL database, and key strategies for optimizing its query performance.
When Probably is Good Enough
Probabilistic data structures give developers room to massively cut down on space requirements while sacrificing a bit of accuracy, so when is probably good enough?
Who broke the build? -Using Kuttl to test and Release faster
No one wants to be responsible for breaking the build. But what can you do as a developer to avoid being the bad guy? How can project leads enable their teams to reduce the occurrence of builds?