Towards better information highlighting on technical Q&A platform
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Navigating the knowledge on Stack Overflow (SO) remains challenging. To make the posts vivid to users, SO allows users to write and edit posts with Markdown or HTML so that users can leverage various formatting styles (e.g., bold, italic, and code) to highlight the important information. Previous studies show benefits of information highlighting in various domains (e.g., improving the reading time of humans). However, little is known about how information is highlighted on technical Q&A sites (e.g., Stack Overflow).In this study, we carry out the first large-scale exploratory study on the information highlighting in SO answers. It was observed that overall, information highlighting is prevalent on SO,i.e., 47.6% of the answers have information highlighted. More specifically, 38.5%, 11.3%, and 7.2% of the answers use Code, Bold, and Italic, respectively. Besides source code-related content (e.g., identifiers, and programming keywords), users also frequently highlight updates (e.g., updates of answers), caveats (i.e., a reminder or warning of in which context or condition the provided solution works or does not work), and reference. Users tend to highlight the code words more than other tags. To ease up the highlighting process, we develop approaches to recommend highlighted content automatically by using neural network architectures initially designed for Named Entity Recognition task. models are trained for each type of formatting (i.e., Bold, Italic, Code, and Heading) using the information highlighting dataset we collected from SO answers. The models with CNN architecture achieve precision values ranging from 0.71 to 0.82. While the recall values are much lower than precision values, the model for automatic code content highlighting achieves a recall of 0.73 and an F1 score of 0.75, outperforming the others. The results of these models were later compared with BERT models trained on our datasets. The analysis of failure cases indicates that the majority of the failure cases are missing identification (i.e., the model misses the content that is supposed to highlight) due to that models tend to learn the more frequent highlighted words while struggling to learn less frequent words.