United States v. Google/Findings of Fact/Section 2B

←

United States v. Google
United States District Court for the District of Columbia

Findings of Fact, Section II. General Search Engines

→

4653746United States v. Google — Findings of Fact, Section II. General Search EnginesUnited States District Court for the District of Columbia

Layout 2

B. How a GSE Works (Greatly Simplified)

27. “A general search engine is a tool that you use to search the worldwide web using queries.” Id. at 2167:3-4 (Giannandrea). A GSE attempts to answer all queries by “provid[ing] search results that are relevant to those queries.” Id. at 8093:10-12 (Raghavan); id. at 182:6-8 (Varian). “The primary source of information for Search is the web.” UPX194 at 552.

28. The first step in developing a search engine is to crawl the web. Id. at 552; Tr. at 1774:20-22 (Lehman); id. at 2206:14-15 (Giannandrea) (“[S]tep one [of] building a general search engine would be to take a copy of as much of the web as you can.”). GSEs crawl the web using a “crawling bot,” which “starts with a list of websites[.]” Tr. at 2206:17-20 (Giannandrea). The bot “crawls the HTML on those websites and then it looks at the links inside of those web pages and then recursively crawls them.” Id. And, because websites “are constantly changing and the web is constantly growing,” GSEs “constantly recrawl the web to index new content.” UPX194 at 552–53.

29. The results of the web crawling are organized into an index. An index is “a database essentially of the whole web that’s publicly available that can be returned if [a] user asks for it.” Tr. at 2656:17-18 (Parakhin). The development of an index is “a crucial piece of the puzzle,” because if a site is not in the index, it will not be presented to users in response to a query. Id. at 6303:20-25 (Nayak); id. at 2210:21 (Giannandrea) (“What you include in the index matters a lot[.]”). Thus, the more sites in an index, the better. Id. at 2212:4 (Giannandrea). Today, only Google and Bing create fulsome web search indexes that generate accessible results. DDG indexes portions of the web to create its own search “modules.” Id. at 1939:2–1941:16 (Weinberg). And Apple maintains an index of about billion websites, although it does not presently plan to use that index to offer a results page. Id. at 2212:9-14 (Giannandrea); FOF ¶ 302.

30. An index is only useful if the GSE understands what the user is seeking with a query. GSEs “aim to identify spelling errors, annotate the query with synonyms, mark multi-word concepts, generate terms related to the query, and more.” UPX213 at 715. Google does this in many ways: through its spelling and synonyms functions, using “query-based salient terms” (QBST) that are likely to show up in a responsive document, and semantic tools, such as query clustering and segmentation. Id. at 715–16; see also UPX870 at .016–.017.

31. The GSE then must retrieve and rank websites responsive to the query. Common queries can yield a nearly infinite number of potentially responsive sites, so the GSE must include a retrieval system that narrows the volume of responsive links to tens of thousands, as opposed to millions. Tr. at 6331:7-15 (Nayak). The GSE then must rank these several thousand results. It first must decide which results are worth scoring at a more granular level, and then score those hundreds of sites to determine which top 10 or so should be surfaced to the user. Id. at 6331:13–6332:11 (Nayak); infra Section II.G.

32. The above-described culling and sorting process by which a GSE produces search results is illustrated below:

DXD17 at 2.