DM882: Text Mining

Study Board of Science

Teaching language: Danish or English depending on the teacher, but English if international students are enrolled
EKA: N340090102
Assessment: Second examiner: External
Grading: 7-point grading scale
Offered in: Odense
Offered in: Spring
Level: Master

STADS ID (UVA): N340090101
ECTS value: 5

Date of Approval: 01-11-2022


Duration: 1 semester

Version: Approved - active

Comment

The course was not offered Spring 2023, it is according to IMADA offered again in Spring 2024.

Entry requirements

None

Academic preconditions

Students taking the course are expected to:

  • Have basic knowledge in probability theory, e.g. by having followed DM566 (Data Mining and Machine Learning)
  • Have basic knowledge in algorithmics, obtained e.g. by having followed DM507 (Algorithms and data structures)
  • Have proficiency in programming, preferably Python, e.g. by having followed DM561 (Linear Algebra)

Course introduction

The aim of the course is to provide introduction to Text Mining of unstructured text in natural languages. Increasing amount of digitized text calls for development of formal frameworks to process such data to extract information and draw statistical conclusions based on its content. The course is designed to provide a sound theoretical basis in processing unstructured text and to provide example applications of such. We will start working with simple examples of unstructured text demonstrating the abilities of current Text Mining methods to highlight their advantages and shortcomings. We will then move to applications of such methods on more realistic datasets sourced from online news media and scientific publications. The content of this course is designed to give an applications context of computer science/data science methods handling real-world data.

In relation to the competence profile of the degree it is the explicit focus of the course to:

  • Give knowledge of some of the main sources and representations of unstructured text.
  • Give the competence to normalize unstructured text into suitable corpora for computational applications.
  • Give understanding of methods such as Named Entity Recognition, Topic Detection or Sentiment Analysis. 
  • Give examples of applications of Text Mining methods, providing a ability to choose the right set of tools for a task.
  • Provide a basis to plan and carry out Text Mining tasks starting from raw unstructured text and ending with a set of conclusions.  
  • Give understanding of the applications of theoretical computer science methods on real-world data.

Expected learning outcome

The learning objective of the course is that the student demonstrates the ability to:
  • Understand some of the main types of unstructured text.
  • Ability to manipulate unstructured text. 
  • Transform unstructured text into suitable normalized representation.
  • Train and execute Named Entity Recognition Models.
  • Train and execute Topic Detection Models.
  • Train and execute Sentiment Analysis Models.
  • Understand Machine Translation Models.
  • Understand limitations of text mining methods based on the content such as non-English text (e.g. Danish or Mandarin).
  • Perform statistical analysis on unstructured text.
  • Understand the limits and drawbacks of Text Mining methods
  • Ability to form hypotheses regarding unstructured text and pick the tools to check the hypotheses.

Content

The following main topics are contained in the course:
  • Sources and formats of unstructured text.
  • Normalization, representation and annotation of unstructured text into corpora.
  • Named Entity Recognition Models
  • Topic Detection Models
  • Sentiment Analysis Models
  • Machine Translation Models
  • Supervised and unsupervised analysis of unstructured text.

Literature

See itslearning for syllabus lists and additional literature references.

Examination regulations

Exam element a)

Timing

Spring

Tests

Project

EKA

N340090102

Assessment

Second examiner: External

Grading

7-point grading scale

Identification

Full name and SDU username

Language

Normally, the same as teaching language

Examination aids

To be announced during the course.

ECTS value

5

Indicative number of lessons

35 hours per semester

Teaching Method

The teaching method is based on three phase model.

  • Intro phase: 20 hours
  • Skills training phase: 15 hours, hereof tutorials: 15 hours
Activities during the study phase: Solving small assignments, individually or in small groups.

Teacher responsible

Name E-mail Department
Konrad Krawczyk konradk@imada.sdu.dk Institut for Biokemi og Molekylær Biologi

Timetable

Administrative Unit

Institut for Matematik og Datalogi (datalogi)

Team at Educational Law & Registration

NAT

Offered in

Odense

Recommended course of study

Profile Education Semester Offer period

Transition rules

Transitional arrangements describe how a course replaces another course when changes are made to the course of study. 
If a transitional arrangement has been made for a course, it will be stated in the list. 
See transitional arrangements for all courses at the Faculty of Science.