Web archive lets you easily search millions of

Imagine trying to find a single redacted document buried somewhere in 70 million pages of government records — a needle, it seemed, in an impossible haystack. That's exactly the problem a team at the University of Washington set out to solve, and they've just built a tool that might change how ordinary people, journalists, and historians access public information forever.

Their creation, called GovScape, is a search system designed specifically for the End of Term Web Archive — a massive digital collection that preserves every administration's web presence from George W. Bush's second term in 2008 through 2024. The archive contains everything from policy documents to aerial photographs, pie charts, and redacted pages. But its sheer size has made it nearly impossible to navigate without knowing exactly what you're looking for.

"The End of Term Web Archive is immensely important to historians, journalists and the American public," said Benjamin Charles Germain Lee, a UW assistant professor in the Information School who led the research. "But many of these digital archives are getting so big that finding information is the real challenge."

GovScape tackles that challenge with three different ways to search. Users can look up exact keywords — say, "FAFSA" — or use semantic search, which finds documents about a topic even if those exact words don't appear on the page. A visual search option lets people query for qualities like "redacted documents" or "pie charts." The system currently covers the 10 million PDFs hosted online during Donald Trump's first term, with plans to expand to the full 70-million-document archive.

What makes GovScape remarkable isn't just what it can do — it's how efficiently it does it. The team processed those 10 million PDFs for less than $1,500, roughly $1 per 47,000 pages. By comparison, commercial AI services might charge the same amount to parse around just 100 pages. The researchers achieved this by using highly efficient AI models that convert each page's text and images into mathematical "embeddings" — essentially fingerprints that capture what each document contains — then grouping similar pages together.

"Just as library classification systems group books on similar topics on the same shelf, these embeddings group similar pages with one another based on their visual and textual content," Lee explained.

The team presented the research July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego, and the work is published on the arXiv preprint server.

Looking ahead, Lee hopes to scale GovScape across the entire archive and eventually expand beyond PDFs to include spreadsheets, images, and HTML pages — capturing the full breadth of government information online. For Lee, the stakes are personal.

"I'm really excited about the prospects for better access to government information with projects like GovScape," he said. "Being able to actually find relevant information is vital to the health of democracy and to the functioning of society."

Web archive lets you easily search millions of government documents