Towards better information highlighting on technical Q&A platform

Ahmed, Shahla

Towards better information highlighting on technical Q&A platform

dc.contributor.author	Ahmed, Shahla
dc.contributor.examiningcommittee	Leung, Carson (Computer Science)
dc.contributor.examiningcommittee	Turgeon, Max (Statistics)
dc.contributor.supervisor	Wang, Shaowei
dc.date.accessioned	2023-09-05T21:38:12Z
dc.date.available	2023-09-05T21:38:12Z
dc.date.issued	2023-08-24
dc.date.submitted	2023-08-24T15:41:23Z	en_US
dc.date.submitted	2023-09-05T21:29:05Z	en_US
dc.degree.discipline	Computer Science	en_US
dc.degree.level	Master of Science (M.Sc.)
dc.description.abstract	Navigating the knowledge on Stack Overflow (SO) remains challenging. To make the posts vivid to users, SO allows users to write and edit posts with Markdown or HTML so that users can leverage various formatting styles (e.g., bold, italic, and code) to highlight the important information. Previous studies show benefits of information highlighting in various domains (e.g., improving the reading time of humans). However, little is known about how information is highlighted on technical Q&A sites (e.g., Stack Overflow).In this study, we carry out the first large-scale exploratory study on the information highlighting in SO answers. It was observed that overall, information highlighting is prevalent on SO,i.e., 47.6% of the answers have information highlighted. More specifically, 38.5%, 11.3%, and 7.2% of the answers use Code, Bold, and Italic, respectively. Besides source code-related content (e.g., identifiers, and programming keywords), users also frequently highlight updates (e.g., updates of answers), caveats (i.e., a reminder or warning of in which context or condition the provided solution works or does not work), and reference. Users tend to highlight the code words more than other tags. To ease up the highlighting process, we develop approaches to recommend highlighted content automatically by using neural network architectures initially designed for Named Entity Recognition task. models are trained for each type of formatting (i.e., Bold, Italic, Code, and Heading) using the information highlighting dataset we collected from SO answers. The models with CNN architecture achieve precision values ranging from 0.71 to 0.82. While the recall values are much lower than precision values, the model for automatic code content highlighting achieves a recall of 0.73 and an F1 score of 0.75, outperforming the others. The results of these models were later compared with BERT models trained on our datasets. The analysis of failure cases indicates that the majority of the failure cases are missing identification (i.e., the model misses the content that is supposed to highlight) due to that models tend to learn the more frequent highlighted words while struggling to learn less frequent words.
dc.description.note	October 2023
dc.identifier.uri	http://hdl.handle.net/1993/37572
dc.language.iso	eng
dc.rights	open access	en_US
dc.subject	Stack Overflow
dc.subject	Information highlighting
dc.subject	Named entity recognition
dc.subject	Deep learning
dc.title	Towards better information highlighting on technical Q&A platform
dc.type	master thesis	en_US
local.subject.manitoba	yes

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Shahla_Ahmed_Thesis.pdf
Size:: 784.09 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 770 B
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

FGS - Electronic Theses and Practica
Manitoba Heritage Theses