Towards better information highlighting on technical Q&A platform

dc.contributor.authorAhmed, Shahla
dc.contributor.examiningcommitteeLeung, Carson (Computer Science)
dc.contributor.examiningcommitteeTurgeon, Max (Statistics)
dc.contributor.supervisorWang, Shaowei
dc.date.accessioned2023-09-05T21:38:12Z
dc.date.available2023-09-05T21:38:12Z
dc.date.issued2023-08-24
dc.date.submitted2023-08-24T15:41:23Zen_US
dc.date.submitted2023-09-05T21:29:05Zen_US
dc.degree.disciplineComputer Scienceen_US
dc.degree.levelMaster of Science (M.Sc.)
dc.description.abstractNavigating the knowledge on Stack Overflow (SO) remains challenging. To make the posts vivid to users, SO allows users to write and edit posts with Markdown or HTML so that users can leverage various formatting styles (e.g., bold, italic, and code) to highlight the important information. Previous studies show benefits of information highlighting in various domains (e.g., improving the reading time of humans). However, little is known about how information is highlighted on technical Q&A sites (e.g., Stack Overflow).In this study, we carry out the first large-scale exploratory study on the information highlighting in SO answers. It was observed that overall, information highlighting is prevalent on SO,i.e., 47.6% of the answers have information highlighted. More specifically, 38.5%, 11.3%, and 7.2% of the answers use Code, Bold, and Italic, respectively. Besides source code-related content (e.g., identifiers, and programming keywords), users also frequently highlight updates (e.g., updates of answers), caveats (i.e., a reminder or warning of in which context or condition the provided solution works or does not work), and reference. Users tend to highlight the code words more than other tags. To ease up the highlighting process, we develop approaches to recommend highlighted content automatically by using neural network architectures initially designed for Named Entity Recognition task. models are trained for each type of formatting (i.e., Bold, Italic, Code, and Heading) using the information highlighting dataset we collected from SO answers. The models with CNN architecture achieve precision values ranging from 0.71 to 0.82. While the recall values are much lower than precision values, the model for automatic code content highlighting achieves a recall of 0.73 and an F1 score of 0.75, outperforming the others. The results of these models were later compared with BERT models trained on our datasets. The analysis of failure cases indicates that the majority of the failure cases are missing identification (i.e., the model misses the content that is supposed to highlight) due to that models tend to learn the more frequent highlighted words while struggling to learn less frequent words.
dc.description.noteOctober 2023
dc.identifier.urihttp://hdl.handle.net/1993/37572
dc.language.isoeng
dc.rightsopen accessen_US
dc.subjectStack Overflow
dc.subjectInformation highlighting
dc.subjectNamed entity recognition
dc.subjectDeep learning
dc.titleTowards better information highlighting on technical Q&A platform
dc.typemaster thesisen_US
local.subject.manitobayes
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Shahla_Ahmed_Thesis.pdf
Size:
784.09 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
770 B
Format:
Item-specific license agreed to upon submission
Description: