Banner Depot 2000

About

Banner Depot 2000 is an interactive archive containing 22,915 web ad banners that existed on Chinese- and English-language web pages in the late 1990s and early 2000s. The banners are extracted from archived web page snapshots of 77,747 URLs featured in six printed Internet directory books published between 1999 and 2001 in the United States and China.

To create this archive, we retrieved available snapshots of these URLs from the Internet Archive's Wayback Machine. We then extracted images adhering to common ad banner dimensions and extracted text data from each image using optical character recognition.

On this website, you can explore the entire banner ad collection, and search for specific banners by keywords.

You can also make banner ad poetry using individual frames from the banner ads, optionally with the help of a large language model.

The full banner ad dataset is available for download at https://doi.org/10.5281/zenodo.8408539.

Created by Richard Lewei Huang and Yufeng Zhao.

a switcheristic telecomunications project

An Explainer about Banner Ad Temporal Coherence

On Banner Depot 2000, we use the following terms to highlight the temporal relationship between the banner ad images and the archived web pages they appear on:

Time Skew

Time skew is the difference between when a banner ad image was captured and when the archived web page it appeared on was captured. A large skew does not necessarily mean that the ad is not temporally coherent (see below) with the archived web page it appears on (and vice versa).

Temporal Coherence

A banner ad image is temporally coherent with the archived web page it appears on if a user in the past visiting the web page on its capture date would have encountered the same banner ad image that now appears on the archived web page snapshot accessed through the Wayback Machine's web interface.

We use the archived ad image's Last-Modified HTTP header, the archived ad image's capture date, and the archived web page's capture date to calculate whether the image was likely to have appeared on the original web page at the web page's capture date. Our calculation method is as follows:

Prima facie coherent: Image Last-Modified date <= Web page snapshot timestamp <= Image capture date

Prima facie violative: Web page snapshot timestamp < Image Last-Modified date < Image capture date

Possibly coherent: Image Last-Modified date < Image capture date < web page snapshot timestamp

Probably violative: No Last-Modified header (usually happens for images served from ad networks, as the networks usually serve a new ad each time the web page loads)

Why is this necessary?

A web page generally consists of an HTML file plus embedded media files (including the banner ad images). When capturing a web page, the Wayback Machine's usually cannot capture all embedded resources on a web page simultaneously with the web page itself. Therefore, when the user requests an archived web page snapshot, the Wayback Machine will deliver a "best effort" reconstruction of a web page by rewriting URLs in the archived HTML file so that each embedded resource loads from the closest available archived version. This process is called recompositionin web archive scholarship. For some archived web pages, the Wayback Machine has to retrieve archived snapshots of embedded page resources archived months or even years from the capture date of the web page itself during recomposition, creating temporal inconsistencies on the resulting web page.

For more information about web archive temporal coherence, see Ainsworth, S. G., Nelson, M. L., & Van de Sompel, H. (2015, August). Only one out of five archived web pages existed as presented. In Proceedings of the 26th ACM Conference on Hypertext & Social Media (pp. 257-266).