The Hoberg and Phillips firm-scope database, along with its industry mappings, is based on the text firms use to describe their businesses and product offerings in annual public-firm 10-K disclosures. Scope is identified at the individual firm level using doc2vec and k-means spatial models that reduce the dimensionality of product descriptions and group their content into buckets based on common product discussions. Keywords that most define each unique industry are then used to map firms to the industries they serve and thus compute scope as the count of the industries a given firm serves in a given year.

The scope database is best described through its two parts (two separate potential downloads). The first is a firm-year panel database containing information about each firm’s scope.  The second is a higher dimensional mapping database that indicates the specific doc2vec industries each firm likely operates in within each year. The result is a firm-industry-year panel database with textual scores indicating how strong the link is between each firm and its mapped industries.

Welcome to the Hoberg-Phillips Firm-Scope Data Library

<< NEW: Data coverage now 1989 to 2021! >>

Data provided

by Gerard Hoberg (University of Southern California),

and Gordon Phillips (Dartmouth College)


* Please cite the following paper when using this data.  Details regarding the creation and use of this data are documented in the paper.

Scope, Scale and Concentration, Journal of Finance (forthcoming). [Download Paper]


Firm Scope Data (firm-scope for each firm in each year):

** This is the primary database that addresses the overwhelming majority of research needs.

Readme file

Text-Based Firm Scope Data

* The extent of firm scope is identified at the individual firm level using doc2vec and k-means spatial modeling.

** Please review all details in the readme file before using the database.

Firm-Industry Assignments in each Year (with industry labels):

Readme file

Firm-Industry Assignments Data

* The mappings for each firm to the likely industries it seves (along with industry labels) is identified at the individual firm level using doc2vec and k-means spatial modeling.

** Please review all details in the readme file before using the database.

