View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Emerging Technology
October 2, 2018updated 20 Jul 2022 7:47am

Dropbox Gets New Search Engine

"For most documents, we rely on Apache Tika to transform the original document into a canonical HTML representation"

By CBR Staff Writer

Dropbox has rebuilt and released a new search engine, dubbed “Nautilus”, saying it is significantly faster at indexing new and updated content — and that the company is working on unlocking search for image, video, and audio files.

The new engine uses machine learning to help power personalised searches for Dropbox’s 500 million users across hundreds of billions of documents; something the company’s engineers described as a unique challenge, owing to the need for searches to be highly personalised and working across rapidly changes sets of documents.

The Dropbox Search Engine Architecture

The system they ultimately built is the first overhaul of the Dropbox search engine since 2015. It uses machine learning to help find files, and required a fundamental rethink of the architecture to make this possible and the separation of indexing and serving.

It targets a budget of 500ms for the 95th percentile search (i.e., only 5 percent of searches should ever take longer than 500ms).

The role of the indexing pipeline is to process file and user activity, extract content and metadata out of it, and create a search index. The serving system then uses this search index to return a set of results in response to user queries.

See also: Dropbox Bets the Farm on Shingled Magnetic Recording

Together, these systems span several geographically-distributed Dropbox data centers, running tens of thousands of processes on more than a thousand physical hosts.

Engineering lead Diwaker Gupta writes of the new engine: “For most documents, we rely on Apache Tika to transform the original document into a canonical HTML representation, which then gets parsed in order to extract a list of “tokens” (i.e. words) and their “attributes” (i.e. formatting, position, etc…).

Content from our partners
Scan and deliver
GenAI cybersecurity: "A super-human analyst, with a brain the size of a planet."
Cloud, AI, and cyber security – highlights from DTX Manchester

“After we extract the tokens, we can augment the data in various ways using a “Doc Understanding” pipeline, which is well suited for experimenting with extraction of optional metadata and signals. As input it takes the data extracted from the document itself and outputs a set of additional data which we call ‘annotations.’ Pluggable modules called “annotators” are in charge of generating the annotations.”

The ranking engine, meanwhile, is powered by a machine learning model that outputs a score for each document based on a variety of signals.

Some signals measure the relevance of the document to the query (e.g., BM25), while others measure the relevance of the document to the user at the current moment in time (e.g., who the user has been interacting with, or what types of files the user has been working on). It is trained on anonymised “click” data from the Dropbox front-end, which excludes any personally identifiable data.

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how Progressive Media Investments may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.