Topic models are a machine learning technique that we used at Data Assessment Solutions at a very early stage. Since these models are still relevant, today we want to briefly present how they work and how we use them.
The most famous topic model is Latent Dirichlet Allocation (LDA) by David Blei, Andrew Ng and Michael Jordan from 2003. The article in which the model was presented is now cited more than 36,000 times, which is a very large number for a computer science article. The idea behind topic models can be well explained on the example of newspaper articles. Newspaper articles are usually assigned to topics such as politics, business, science, culture or sport. However, there are also collections of text documents for which such an assignment is not known. Topic models try to learn topics automatically from a collection of documents. One difficulty is that there are documents where the assignment is not unique. In the case of newspaper articles this could be an article on economic policy, for example. The assignment of an article to a topic depends on the frequency distribution of words in the article. An economic article contains many words from the field of economics, while a political article contains many words from the field of politics. An economic policy article contains many words from both domains. A topic model computes a topic simply as a probability distribution over the words of a dictionary. For example, if the words force, momentum, mass, and gravity have large probabilities in a topic, it might be called a physics topic. On the one hand, it is interesting to see what topics arise from a model, on the other hand one can query the model. In the case of text documents, a query itself can be a text document. The result are then mixing coefficients that specify the extent to which the document belongs to the different topics.
At Data Assessment Solutions, we do not examine text documents, but skill profiles using topic models. It helps that skill profiles can be represented in almost the same way as text documents. A simple representation for text documents is a list that tells, for each word from a dictionary, how often the word appears in the document. For skill profiles, the dictionary is replaced by a skills catalog. The skill profile of an employee indicates at which level, for instance, on a scale from 1 to 5, he or she possesses the skills from the catalogue. From the skill profiles of the employees of a company, we can thus learn topics. In this context, we refer to topics as role profiles. Typical role profiles in an IT company are, for example, system administrators, web developers or database specialists. However, in almost every company there are also interesting role profiles that are specific to the company. The learned role profiles allow an aggregated systematic overview of the experiences and skills available in the company. Catalogs with hundreds to thousands of skills are the rule with our customers. A management overview is thus impossible on the non-aggregated level. With a few dozen topics, on the other hand, you get a good idea of what things look like. Furthermore, topic models are used in the search for employees and skills and in the targeted development of skills in the company.