Chapter 5 - Data and Annotation Step - by - Step Example
Chapter Outline
- Chapter Outline
- Quick Recap
- Main Steps for Data Annotation
- Developing Gold Standard Annotated Corpus using Data Sources with Annotations
- Developing Gold Standard Annotated Corpus using Data Sources without Annotations
- Chapter Summary
Quick Recap
- Quick Recap – Data and Annotations
- A Real-world Problem is defined as a matter or situation regarded as unwelcome or harmful and needing to be dealt with and overcome
- To systematically identify the most suitable Solution to a Real-world Problem, use the following Step by Step approach
- Step 1: Write done the Real-world Problem and your current circumstances
- Step 2: Completely and correctly understand the situation
- Step 3: List down the Possible Solutions that you know
- Step 4: Consult Domain Experts and people who faced similar Real-world Problem in the past and update your List of Possible Solutions
- Step 5: Write down strengths and weaknesses of each Possible Solution
- Step 6: Shortlist and rank 3 Possible Solutions, that seem to be most suitable Solutions
- Step 7: Consult a Domain Expert and select the one which seems to be most suitable to solve the Real-world Problem in your current situation
- Step 8: Apply the selected Solution to solve your Real-world Problem
- If (Real-world Problem can be broken into Input and Output)
- Then
- You can treat that Real-world Problem as a Machine Learning Problem 😊
- Data = Model + Error
- This shows that Data is backbone of Machine Learning
- i.e. why Machine Learning Approach is also called Data Driven Approach
- This shows that Data is backbone of Machine Learning
- To develop Intelligent Programs (or Models) using Machine Learning Approach (or Data Driven Approach), we need
- Large amount of Data
- High-quality Data
- Balanced Data
- Corpus / Dataset is defined as a collection of Real-world Data in Machine-readable Format
- Machine-readable Corpus is
- a corpus which a Machine can read
- Machine Understandable Corpus is
- a corpus which a Learner (Machine Learning Algorithm) can use to learn
- The common practice used to Train and Test Machine Learning Algorithms using a Corpus is as follows
- Step 1: Gold Standard (or benchmark) corpus is developed in Machine-readable Format
- Step 2: Machine-readable Corpus is transformed into Machine Understandable Corpus
- Step 3: Machine Learning Algorithms are Trained and Tested on Machine Understandable Corpus
- The Corpus / Dataset used in Machine Learning can be mainly categorized as
- Annotated Corpus
- Output is associated with all Inputs
- Used for Supervised Learning
- Unannotated Corpus
- Output is not associated with Inputs
- Used for Unsupervised Learning
- Semi-annotated Corpus
- Output is associated with some Inputs
- Used for Semi-supervised Learning
- Annotated Corpus
- An annotated / unannotated / semi-annotated corpus can be
- Mono-lingual Corpus
- Text comprises of only one language
- Multi-lingual Corpus
- Text comprises of more than one languages
- Cross-lingual Corpus
- Comparable Corpus
- A Comparable Corpus is a collection of similar texts in different languages or in different varieties of a language
- Parallel Corpus
- Parallel Corpus is a corpus that contains a collection of original texts in language L1 and their translations into a set of languages L2 … Ln
- Comparable Corpus
- Mono-lingual Corpus
- In Comparable Corpus, the topic of the texts / articles will be same , however, their content will not be same
- In most cases, Parallel corpora contain data from only two languages
- A Parallel Corpus can be
- Bi-lingual Parallel Corpus
- A Bi-lingual Parallel Corpus consists of texts of two languages
- Multi-lingual Parallel Corpus
- A Multi-lingual Parallel Corpus consists of texts of more than two languages
- Uni-directional Parallel Corpus
- A Uni-directional Parallel Corpus contains translation in only one direction
- for e.g., Arabic text translated into Urdu
- Bi-directional Parallel Corpus
- A Bi-directional Parallel Corpus contains translations in both directions
- for e.g., Arabic text translated into Urdu and vice versa
- Multi-directional Parallel Corpus
- A Multi-directional Parallel Corpus contains translations in multiple languages
- For e.g., Arabic text (Quran.e.Pak) translated into Urdu, English, Persian, German, French etc.
- Bi-lingual Parallel Corpus
- For more accurate learning, we need to have
- Annotated Data
- Data Annotation mainly
- adds value to a corpus in that it considerably extends the range of research questions that a corpus can readily address
- The main issues in developing a Gold Standard (or benchmark) Corpus / Dataset to Train / Text Machine Learning Algorithms are
- Data Sampling
- Time Span
- Size of Corpus
- Source(s) of Data
- High-quality Data
- Balanced Data
- Data Collection
- Annotation Guidelines
- Annotators
- Data Protections
- Corpus Standardization
- Population (N)
- Total set of observations (or examples) for a Machine Learning Problem
- Collecting data equivalent to the size of Population, will lead to perfect learning
- Sample (S)
- Subset of observations (or examples) drawn from a Population
- The size of a Sample is always less than the size of the Population from which it is taken
- Most Important Property of a Sample
- A Sample should be true representative of the Population
- A Representative Sample is a subset of a Population that seeks to accurately reflect the characteristics of the Population
- To generate Annotated Corpus of high-quality
- Sample Data must be true representative of the Population
- Language changes with time
- Consequently, Data changes with time
- Determination of a particular Time Span is required to
- capture the features of a language (Data) within that time span
- Remember
- Data = Model + Error
- Thus, when Data changes then
- Model and Error also change
- A Machine Learning Algorithm can be better Trained and Tested if we have large Annotated Corpus
- A corpus should be large because
- wide representation is possible within a larger corpus
- A large corpus should also have the
- scope for regular augmentation
- Data Source(s) used for creating Gold Standard Annotated Corpus should be authentic, appropriate and preferably in Digital Format.
- Two main types of Data Sources are
- Sources with Annotations
- In this type of Data Sources, the Output is present in the Data along with the Input
- Sources without Annotations
- In this type of Data Sources, the Output is not present in the Data along with the Input
- Sources with Annotations
- Sources with / without Annotations can be
- Online Digital Repositories
- Non-digital Repositories
- Existing Corpora
- Skill set required for annotations varies form ML Problem to ML Problem
- Completely and correctly understand the annotations before selecting Annotators to annotate Data
- In Machine Learning
- Low Quality Data = Poor Model + High Error
- High Quality Data = Good Model + Low Error
- Sample Data should be complete, correct, diversified and true representative of the Population?
- To have an unbiased and good Model, the Training Data should be balanced
- Data collection is the systematic approach to gathering Data from authentic Data Source(s)
- Accurate Data Collection is essential to build a high-quality Annotated Corpus
- Three main approaches to Data Collection are
- Manual Approach
- only involves Human
- Semi-automatic Approach
- involves both Machine and Human
- Automatic Approach
- only involves Machine
- Manual Approach
- Data in Digital Format is mainly collected using
- Semi-automatic Approaches
- Automatic Approaches
- A Data Source may contain
- Data can be accessed freely and publicly
- Data can be viewed freely and publicly
- Data is not publicly available
- Annotation Guidelines is the set of rules / guidelines that help an Annotator to assign the most appropriate Output to an Input
- Annotation Guidelines are the backbone of Annotation Process
- Low-quality or inappropriate Annotation Guidelines will result in a low-quality Annotated Corpus and vice versa
- Two Main Situations in Data Annotations
- Annotations are Present in Raw Data
- Don’t need Annotators / Taggers
- Annotations are Not Present in Raw Data
- Need Annotators / Taggers to manually annotate Raw Data
- Annotations are Present in Raw Data
- Standard Practice for Annotations
- Use at least 3 Annotators to annotate Data according to Annotation Guidelines
- All Annotators must be Domain Experts
- Data Protection is the process of safeguarding important information from corruption, compromise or loss
- Data used for developing a Gold Standard Annotated Corpus can be
- Public Data
- To use this type of Data, we don’t need to take explicit permission from the Owner of Data
- Private Data
- To use this type of data, we need to take explicit permission from the Owner of Data to use his / her Data for research purposes
- Both
- Public Data
- A Corpus License means the terms and conditions for use , reproduction , and distribution of the corpus
- For Machine Learning, a corpus should be standardized in
- Machine-readable Format
- Three Main Formats
- Plain Text Format
- CSV Format
- XML Format
- Main characteristics of a Gold Standard Annotated Corpse are as follows
- Sample Data should be true representative of Population
- A Representative Sample should have
- Large Amount of Data
- High-quality Data
- Balanced Data
- Data Sources should be
- Appropriate and
- Authentic
- For Data Collection, all legal and ethical requirements should be fulfilled
- Standard Tools / Techniques should be used for Data Collection
- Corpus Generation Process must be standard and state-of-the-art
- There should be at least three Annotators
- Annotators must be Domain Experts
- Annotation Guidelines should be clearly describe the annotation criteria
- Inter-Annotator Agreement and Kappa Statistics must be good
- Corpus must be standardized properly using Standard Formats
- License of Corpus should be clearly mentioned
- Corpus Characteristics should be clearly mentioned
Main Steps for Data Annotations
- Main Steps for Data Annotation
- Step 1: Completely and correctly understand the Real-world Problem
- Step 2: Check if the Real-world Problem can be treated as a Machine Learning Problem?
- If Yes
- Go to Next Step
- If Yes
- Step 3: Write down Possible Solution(s) to the Annotated Corpus Development Issues discussed in Lecture 3 – Data and Annotations
- Note that Possible Solution(s) to each Annotated Corpus Development Issue should be well justified
- Step 4: Develop proposed Annotated Corpus at the Prototype Level
- Record the problems that you faced in developing Annotated Corpus at prototype level
- Write down Possible Solution(s) to handle problems that encountered in developing Annotated Corpus at prototype level
- Step 5: Develop proposed Annotated Corpus at full scale
- Steps – Annotated Corpus Creation Process
- When you create your proposed Annotated Corpus at prototype and / or full-scale level, follow the following main steps
- Step 1: Raw Data Collection
- Step 1.1: Data Cleaning (if needed)
- Step 1.2: Data Pre-processing (if needed)
- Step 2: Annotation Process
- Step 2.1: Annotation Guidelines
- Step 2.2: Annotations
- Step 2.3: Inter-Annotator Agreement (IAA)
- Step 3: Corpus Characteristics and Standardization
Developing Gold Standard Annotated Corpus using Data Sources with Annotations
- Types of Data Sources
- An authentic and appropriate Data Source is essential to develop a large Gold Standard Annotated Corpus
- As discusses in Lecture 3 – Data and Annotations, the main types of Data Sources are
- Sources with Annotations
- Sources without Annotations
- Sources with / without Annotations can be
- Online Digital Repositories
- Non-digital Repositories
- Existing Corpora
- Three Main Types of Machine Learning Problems
- The Machine Learning Problems can be broadly categorized into three main types
- Classification Problems
- Regression Problems
- Sequence to Sequence Problems
- Note
- For each type of Machine Learning Problem
- Suitable Machine Learning Algorithms may differ
- For each type of Machine Learning Problem
- Classification Problems – Input and Output
- Input
- Structured / Unstructured / Semi-structured
- Output
- Categorical
- Example 1 - Classification Problems
- Machine Learning Problem
- Gender Identification from Audio
- Input
- Audio
- Output
- Male / Female
- Note that it is a Binary Classification Problem
- Example 2 - Classification Problems
- Machine Learning Problem
- Human Detection from Video
- Input
- Video
- Output
- Human / Non-Human
- Note that it is a Binary Classification Problem
- Example 3 - Classification Problems
- Machine Learning Problem
- Sentiment Detection from Text
- Input
- A Text
- Output
- Positive / Negative / Neutral
- Note that it is a Multi-class Classification Problem
- Example 4 - Classification Problems
- Machine Learning Problem
- Toxic Comment Detection
- Input
- A Text
- Output
- Toxic / Severe-Toxic / Obscene / Threat / Insult / Non-Toxic
- Note that it is a Multi-class Classification Problem
- You can transform above Multi-class Classification Problem into Binary Classification Problem
- How?
- Combine 5 classes: Toxic, Severe-Toxic, Obscene, Threat and Insult, into a Single Class e. Toxic
- Toxic Comment Detection Treated as a Binary Classification Problem
- Input
- A Text
- Output
- Toxic / Non-Toxic
- Input
- How?
- Lesson Learned
- A Machine Learning Problem may be treated both as
- Multi-class Classification Problem and
- Binary Classification Problem
- A Machine Learning Problem may be treated both as
- Example 5 - Classification Problems
- Machine Learning Problem
- Detecting Fake Video
- Input
- A Video
- Output
- Fake / Real
- Regression Problems – Input and Output
- Input
- Structured / Unstructured / Semi-structured
- Output
- Numeric
- Example 1 - Regression Problems
- Machine Learning Problem
- GPA Prediction (1st Semester) for University Students
- Input
- Structured
- A Fixed Set of 2 Attributes
- Matric Marks
- FSc Marks
- A Fixed Set of 2 Attributes
- Structured
- Output
- GPA (1st Semester)
- Note
- In Input, I have used only two Attributes (for simplicity)
- We may use other Attributes as well
- Example 2 - Regression Problems
- Machine Learning Problem
- House Price Prediction
- Input
- Structured
- A Fixed Set of 8 Attributes
- The neighborhood, Building Type, House Style, Overall Quality, Overall Condition, Year Built, Air Conditioning, Garage Type
- A Fixed Set of 8 Attributes
- Structured
- Output
- Sale Price
- Note
- In Input, I have used only 8 Attributes (for simplicity)
- We may use other Attributes as well
- Example 3 - Regression Problems
- Machine Learning Problem
- Predicting Sales of a Product
- Input
- A Single Attribute
- Previous Sales per Day
- Output
- Future Sale
- A Single Attribute
- Example 4 - Regression Problems
- Machine Learning Problem
- Predicting Admission in University
- Input
- A Single Attribute
- Previous Admission per Semester
- A Single Attribute
- Output
- Future Admissions
- Example 5 - Regression Problems
- Machine Learning Problem
- Electricity Load Prediction
- Input
- A Single Attribute
- Electricity Consumption per Hour
- A Single Attribute
- Output
- Future Electricity Consumption
- Sequence to Sequence Problems – Input and Output
- Input
- Unstructured (of variable length)
- Output
- Unstructured (of variable length)
- Example 1 - Sequence to Sequence Problems
- Machine Learning Problem
- Text Summarization
- Input
- A Source Text
- Output
- Summary
- Note that the length of Input is quite large compared to the length of Output
- Example 2 - Sequence to Sequence Problems
- Machine Learning Problem
- Machine Translation
- Input
- Source Text in One Language (Arabic)
- Output
- Translation of Source Text in Target Language (Urdu)
- Note that the length of Input is almost same to length of Output
- Example 3 - Sequence to Sequence Problems
- Machine Learning Problem
- Automatic Paraphrase Generation
- Input
- A Text
- Output
- Paraphrase (of Input Text)
- Note that the length of Input is almost same to length of Output
- Example – Automatic Paraphrase Generation
- Input
- Allah pak Aajzi pr milty hain
- Output
- Aajzi krny waly Allah ko pa jaty hain
- Input
- Example 4 - Sequence to Sequence Problems
- Machine Learning Problem
- Natural Language Description Generation from Image
- Input
- Image
- Output
- Textual Description of Image
- Note that the Input is image and Output is Text
- Example 5 - Sequence to Sequence Problems
- Machine Learning Problem
- Speech to Text
- Input
- Audio
- Output
- Text
- Note that the Input is audio and Output is Text
- Note
- In Sha Allah, In the rest of this Lecture, I will try to present simple, detailed and step by step examples on
- How to create Gold Standard Annotated Corpus using Data Source with / without Annotations
- Note that for simplicity and understanding, we will create
- Gold Standard Annotated Corpus at Prototype Level 😊
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 1
- Task
- Consider the following Real-world Problems and answer the questions given below
- Emotion Detection
- Classes / Categories = Sad, Happy, Disgust, Love, Neutral
- Sentiment Analysis
- Classes / Categories = Positive, Negative, Neutral
- Text Reuse Detection
- Classes / Categories = Wholly Derived, Partially Derived, Non-Derived
- Plagiarism Detection
- Classes / Categories = Near Copy, Lightly Paraphrased, Heavily Paraphrased, Non-Plagiarized
- Emotion Detection
- Consider the following Real-world Problems and answer the questions given below
- Note
- Your answers should be
- Well Justified
- Your answers should be
- Questions
- Which of these Real-world Problems can be treated as a Binary Classification Problem?
- Which of these Real-world Problems can be treated as a Multi-class Classification Problem?
- If any of these Real-world Problems is a Multi-class Classification Problem, then
- Explain how will you convert it to a Binary Classification Problem?
- Write Input and Output for Multi-class Classification Problem?
- Write Input and Output for Binary Classification Problem?
Your Turn Tasks
Your Turn Task 1
- Task
- Identify three Real-world Problems which can be treated both as Multi-class Classification Problem and Binary Classification Problem?
- Questions
- Explain how will you convert Multi-class Classification Problem to a Binary Classification Problem?
- Write Input and Output for Multi-class Classification Problem?
- Write Input and Output for Binary Classification Problem?
Example 1 – Developing Gold Standard Annotated Corpus using Data Sources with Annotations – A Step by Step Example
- Main Steps for Data Annotation
- Step 1: Completely and correctly understand the Real-world Problem
- Step 2: Check if the Real-world Problem can be treated as a Machine Learning Problem?
- If Yes
- Go to Next Step
- If Yes
- Step 3: Write down Possible Solution(s) to the Annotated Corpus Development Issues discussed in Lecture 3 – Data and Annotations
- Note that Possible Solution(s) to each Annotated Corpus Development Issue should be well justified
- Step 4: Develop proposed Annotated Corpus at the Prototype Level
- Record the problems that you faced in developing Annotated Corpus at prototype level
- Write down Possible Solution(s) to handle problems that encountered in developing Annotated Corpus at prototype level
- Step 5: Develop proposed Annotated Corpus at full scale
- Step 1: Completely and correctly understand the Real-world Problem
Understanding Real-world Problem
- Real-world Problem
- In Journalism, hundreds and thousands of News Articles are published online on daily basis. Manually extracting useful information from such huge volumes of data is a non-trivial task.
- A Possible Solution
- Develop an automatic summarization system, which accurately generates short summary (or headline) of a News Article
- Step 2: Completely and correctly understand the Real-world Problem
- Recall Lecture 2 – Basics of Machine Learning
- Majority of Machine Learning involves
- Learning Input-Output Functions
- Majority of Machine Learning involves
- Conclusion
- If you can break a Real-world Problem into
- Input and Output then
- You can treat that Real-world Problem as a Machine Learning Problem 😊
- If you can break a Real-world Problem into
- Goal
- Develop an automatic text summarization system
- Input
- An Urdu News Article
- Output
- A Short Summary (or Headline)
- Conclusion
- Yes, we can treat Text Summarization Problem (Real-world problem) as a Machine Learning Problem 😊
- Note
- Text Summarization Problem is a Sequence to Sequence Problem because
- Input is unstructured and of variable length
- Output is unstructured and of variable length
- Note that length of Input is quite large compared to the length of Output
- Text Summarization Problem is a Sequence to Sequence Problem because
- Step 3: Possible Solutions to Annotated Corpus Development Issues
- Research Focus and Annotated Corpus Development Issues
- Research Focus plays a key role in Annotated Corpus Development Process
- Example - Research Focus
- Research Focus 1
- Develop an automatic mono-lingual text summarization system for English news articles
- Remarks / Comments
- A huge amount of English news articles (data) are easily and readily available through online digital repositories
- Consequently, it will be easy to collect large Sample Data for developing Gold Standard Annotated Corpus
- Research Focus 2
- Develop an automatic mono-lingual text summarization system for Urdu news articles
- Remarks / Comments
- A large amount of Urdu news articles (data) are easily and readily available through online digital repositories
- Consequently, it will be easy to collect sufficiently large Sample Data for developing Gold Standard Annotated Corpus
- Research Focus 3
- Develop an automatic mono-lingual text summarization system for Punjabi news articles
- Remarks / Comments
- A large amount of Punjabi news articles (data) are not easily and readily available through online digital repositories
- Consequently, it will be very difficult to collect sufficiently large Sample Data for developing Gold Standard Annotated Corpus
- Tip
- When planning to develop a Gold Standard Annotated Corpus
- Step 1: Write down your Research Focus in a single sentence
- Step 2: Search on internet for potential Data Source(s) and discuss with your Supervise (or seniors)
- Step 3:
If (You and your Supervisor are satisfied that you will be able to collect sufficiently large Sample Data) Then Go ahead and start developing Gold Standard Annotated Corpus Else You will need to change your Research Focus ☹ |
- Remember
- To be successful in life, only take
- Calculated Risks 😊
- To be successful in life, only take
- Research Focus for Our Current Example
- Research Focus
- Develop an automatic mono-lingual text summarization system for Urdu language
- Issue 1 – Data Sampling
- Population
- All news articles published in Urdu newspapers
- Main Characteristics of Urdu News Articles
- Countries
- Urdu news articles are published in many countries of the world, e.g. Pakistan, India, England, USA etc.
- Domains
- Politics
- Sports
- War
- Religion
- Entertainment
- Crime and Court
- Economics
- Science and Technology
- Lifestyle
- Other
- Length
- Small (<= 1 page)
- Medium (2 – 3 pages)
- Large (< 3 pages)
- Popular Urdu Newspaper Websites
- Countries
- Representative Sample
- To draw a Representative Sample, I would collect Sample Data with following characteristics
- Characteristics
- Country
- Pakistan
- Domains
- Politics
- Sports
- War
- Religion
- Entertainment
- Crime and Court
- Economics
- Science and Technology
- Lifestyle
- Other
- Length of Urdu News Article
- Preferably Short (may also include Medium size Urdu news articles)
- Country
- Issue 2 – Time Span
- I would collect online Urdu news stories published in last five years
- i.e., From 1-1-2017 To 31-12-2021
- Assumption
- The Urdu language used in news articles in last five years is almost same as present day
- Issue 3 – Size of Corpus
- I target that my proposed corpus should contain
- 1 Million news articles
- TIP
- To hit the target of 1 Million, try to collect
- 1 Million news articles
- Remember
- To hit the target, always aim a bit higher 😊 Good Luck
- To hit the target of 1 Million, try to collect
- Issue 4 – Potential Data Sources
- Potential Data Sources
- I aim to collect Sample Data from the following online digital repositories
- https://www.express.pk/
- https://jang.com.pk/
- https://www.nawaiwaqt.com.pk/
- https://www.bbc.com/urdu
- I aim to collect Sample Data from the following online digital repositories
- Reasons for Selecting these Data Sources
- The above-mentioned Data Sources are
- Authentic
- Appropriate to create Urdu Text Summarization Corpus
- Free and publicly available for research purposes
- Contain large amount of Urdu news articles in Digital Format
- Contain Urdu news articles in various domains
- Majority of Urdu news articles are Small to Medium in size
- The above-mentioned Data Sources are
- Potential Challenges in Collecting Data from Data Sources
- Problem 1
- Manual Data Collection of 1 Million Urdu news articles is not practically possible
- A Possible Solution
- Develop a Web Crawler (an automatic program) to extract Urdu news articles from online Data Sources
- Problem 1
- Issue 5 – High-quality Data
- I collected Sample Data from online Data Sources using a Web Crawler
- Problem 1
- When Sample Data will be extracted through Web Crawler it will contain noise
- A Possible Solution
- Write a program to automatically clean Sample Data
- Problem 2
- Text in Urdu news articles will not be properly tokenized
- A Possible Solution
- Use an accurate Urdu Word Tokenizer to properly tokenize the text in Urdu news articles
- Problem 1
- Note
- To improve quality of Sample Textual Data, I properly
- cleaned and tokenized it
- To improve quality of Sample Textual Data, I properly
- Issue 6 – Balanced Data
- There are total 10 domains (see Issue 1 – Data Sampling)
- In Sha Allah, I aim to collect 100K Urdu news articles for each domain
- Corpus Size = 10 x 100K
- Corpus Size = 1 Million
- Issue 7 – Data Collection
- Sample Data that I need to develop my proposed text summarization corpus is in
- Digital Format
- I will use a Web Crawler to automatically extract (or collect) Sample Data from online Data Sources
- Issue 8 – Annotation Guidelines
- In my targeted Data Sources, Output (Headline) is provided with the Input (Urdu News Article)
- So, I do not need to design Annotation Guidelines
- Issue 9 – Annotators
- In my targeted Data Sources, Output (Headline) is provided with the Input (Urdu News Article)
- So, I do not need Annotators
- Issue 10 - Data Protection
- Since I am using free and publicly available data for research purposes
- Therefore, there are no Data Privacy issues
- Insha Allah, I will also make my proposed corpus
- Free and publicly available for research purposes 😊
- Remember
- To become a great personality
- Serve the Humanity for the Sake of Allah (مخلوق خدا کی بےلوث خدمت)
- To become a great personality
- Issue 11: Corpus Standardization
- In Sha Allah, I will standardize my proposed corpus in two format
- CSV Format
- Step 4: Developing Proposed Annotated Corpus at Prototype Level
- Developing Proposed Annotated Corpus at Prototype Level
- The biggest advantage of doing a task at prototype level is that
- Through manual inspection, you can verify whether your automatic approach (or a process) is working correctly or not?
- The biggest advantage of doing a task at prototype level is that
- Remember
- If you cannot successfully execute a task at prototype level
- You can never execute it at Real-world level
- If you cannot successfully execute a task at prototype level
- Remember
- Note
- Insha Allah, I will develop a small (at prototype level) Gold Standard Annotated Corpus of 15 instances
- I will also list down the problems (along with their Possible Solutions) that I encountered in developing Gold Standard Annotated Corpus at prototype level
- Steps – Annotated Corpus Creation Process
- When you create your proposed annotated corpus at prototype and / or full scale level, follow the following main steps
- Step 1: Raw Data Collection
- Step 1.1: Data Cleaning (if needed)
- Step 1.2: Data Pre-processing (if needed)
- Step 2: Annotation Process
- Step 2.1: Annotation Guidelines
- Step 2.2: Annotations
- Step 2.3: Inter-Annotator Agreement (IAA)
- Step 3: Corpus Characteristics and Standardization
- Step 1: Raw Data Collection
- Insha Allah, I will follow the following steps to collect Raw Data
- Step 1: Write a program in Python programming language (called urls-list.py) to make a List of URLs containing Urdu news articles
- Step 2: Use urls-list.py to extract URLs from online Data Sources (websites)
- I will make separate List of URLs for each online Data Source
- bbc-urls-list.txt
- express-urls-list.txt
- jang-urls-list.txt
- nawaewaqt-urls-list.txt
- I will make separate List of URLs for each online Data Source
- Step 3: Write a program in Python programming language (called web-crawler.py) to extract Urdu news articles text using List of URLs generated in Step 2
- Step 4: Write a program in Python programming language (called clean-data.py) to clean data extracted in Step 3
- Step 5: Write a program in Python programming language (called word-tokenizer.py) to properly tokenize the data cleaned in Step 4
- Step 2: Annotation Process
- Note
- For Annotation Process, I will use the cleaned and pre-process (properly tokenized) data
- Write a program in Python programming language (called annotation.py) to separate Input (Urdu News Article) from Output (Headline)
- Step 3: Corpus Characteristics and Standardization
- Insha Allah, I will summarize the main characteristics of my Gold Standard Annotated Corpus in Table
- Main Characteristics of Gold Standard Annotated Corpus
- Total number of documents in corpus
- Total number of words in corpus
- Total number of unique words in corpus
- Total number of words in Urdu News Articles
- Total number of words in Headlines (or summaries)
- Total number of unique words in Urdu News Articles
- Total number of unique words in Headlines (or summaries)
- Average length of Urdu News Article
- Average length of Headline (or summary)
- Corpus Standardization
- I standardized my Gold Standard Annotated Corpus in
- CSV Format
- See summarization-sample-data.xlsx file in Supporting Material
- I standardized my Gold Standard Annotated Corpus in
- Step 5: Developing Proposed Annotated Corpus at Full Scale Level
- In this step, I will develop the full scale corpus of 1 Million instances using
- The same steps I followed in Step 4: Developing Proposed Annotated Corpus at Prototype Level
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 2
- Task
- Ghufran wants to develop a Gold Standard Annotated Corpus for the following Real-world Problem
- Real-world Problem
- Automatic Title Generation for Urdu Columns
- Note
- Your answer should be
- Well Justified
- Your answer should be
- Questions
- Write Input and Output for the above Real-world Problem?
- What type of Data Sources will be available for creating corpus for mentioned Real-world Problem?
- Sources with Annotations
- Sources without Annotations
- Select 10 Urdu Columns from the following Website
- Apply Main Steps for Data Annotation to create your Gold Standard Annotation Corpus for the task: Automatic Title Generation for Urdu Columns
Your Turn Tasks
Your Turn Task 2
- Task
- Select a Real-world Problem and write it in one sentence (similar to: Automatic Title Generation for Urdu Columns). Your task is to create a Gold Standard Annotated Corpus for your Real-world Problem. However, you must select Data Sources with Annotations to collect Sample Data for creating Gold Standard Annotated Corpus.
- Note
- Your answer should be
- Well Justified
- Your answer should be
- Questions
- Write Input and Output for your selected Real-world Problem?
- Apply Main Steps for Data Annotation to create your Gold Standard Annotation Corpus for the selected Real-world Problem.
Example 2 – Developing Gold Standard Annotated Corpus using Data Sources with Annotations – A Step by Step Example
- Main Steps for Data Annotation
- Step 1: Completely and correctly understand the Real-world Problem
- Step 2: Check if the Real-world Problem can be treated as a Machine Learning Problem?
- If Yes
- Go to Next Step
- If Yes
- Step 3: Write down Possible Solution(s) to the Annotated Corpus Development Issues discussed in Lecture 3 – Data and Annotations
- Note that Possible Solution(s) to each Annotated Corpus Development Issue should be well justified
- Step 4: Develop proposed Annotated Corpus at the Prototype Level
- Record the problems that you faced in developing Annotated Corpus at prototype level
- Write down Possible Solution(s) to handle problems that encountered in developing Annotated Corpus at prototype level
- Step 5: Develop proposed Annotated Corpus at full scale
- Step 1: Completely and correctly understand the Real-world Problem
- Real-world Problem
- In university, 30% of the students drop out in the First Year. As can be noted that this number is alarmingly high. How we can use technology to alert students that they are going to drop out well before time, e.g., on their first day at university?
- A Possible Solution
- Develop an Intelligent Program
- which accurately predicts the GPA (1st semester) of a university student and
- alerts him / her at the start of the 1st semester that (s)he will drop at the end of 1st semester
- Develop an Intelligent Program
- Step 2: Can We Treat Real-world Problem as a Machine Learning Problem?
- Recall Lecture 2 – Basics of Machine Learning
- Majority of Machine Learning involves
- Learning Input-Output Functions
- Majority of Machine Learning involves
- Conclusion
- If you can break a Real-world Problem into
- Input and Output then
- You can treat that Real-world Problem as a Machine Learning Problem 😊
- If you can break a Real-world Problem into
- Goal
- Develop an automatic system to predict a student’s GPA in 1st semester using his marks in Matric and FSc
- Input
- Marks in Matric
- Marks in F.Sc.
- Output
- GPA in 1st semester
- Conclusion
- Yes, we can treat the GPA Prediction Problem (Real-world Problem) as a Machine Learning Problem 😊
- Note
- GPA Prediction Problem is a Regression Problem because
- Output is Numeric
- GPA Prediction Problem is a Regression Problem because
- Step 3: Possible Solutions to Annotated Corpus Development Issues
- Research Focus and Annotated Corpus Development Issues
- Research Focus plays a key role in Annotated Corpus Development Process
- Example – Research Focus
- Research Focus 1
- Develop a GPA Prediction system for university students in the world studying in various disciplines
- Remarks / Comments
- It will be practically impossible to collect large and high-quality data of university students in the world for all the disciplines
- Also, the amount of time and effort required to collect data will be more than the amount of work required for a MS / PhD thesis
- Conclusion
- To conclude, the Research Focus is very broad, we need to narrow it down
- It will be practically impossible to collect large and high-quality data of university students in the world for all the disciplines
- Research Focus 2
- Develop a GPA Prediction system for university students in Pakistan studying in various disciplines
- Remarks / Comments
- It will be very challenging to collect large and high-quality data of university students in Pakistan for all the disciplines
- Also, the amount of time and effort required to collect data will be more than the amount of work required for a MS / PhD thesis
- Conclusion
- To conclude, the Research Focus is broad, we need to narrow it down
- Research Focus 3
- Develop a GPA Prediction system for university students in Pakistan studying Computer Science at Undergrad level in three degree programs: BS(CS), BS(SE) and BS(IT)
- Remarks / Comments
- It will be possible to collect large and high-quality data of university students in Pakistan studying Computer Science at Undergrad level in three degree programs: BS(CS), BS(SE) and BS(IT)
- Also, the amount of time and effort required to collect data will be enough than the amount of work required for a MS / PhD thesis
- Conclusion
- To conclude, the Research Focus is fine 😊
- Note
- Before submitting a FYP / MS / PhD thesis, answer the following question
- Question
- Is this work enough for FYP / MS / PhD thesis?
- Answer
(Prof. Robert Gaizauskas, University of Sheffield, UK) |
- Tip
- When planning to develop an Annotated Corpus
- Step 1: Write down your Research Focus in a single sentence:
- Step 2: Search on internet for potential Data Source(s) and discuss with your Supervise (or seniors)
- Step 3:
|
- Remember
- To be successful in life, only take
Calculated Risks 😊
- Research Focus for Our Current Example
- Research Focus
- Develop a GPA Prediction system for university students in Pakistan studying Computer Science at Undergrad level in three degree programs: BS(CS), BS(SE) and BS(IT)
- Issue 1 – Data Sampling
- Population
- All university students in Pakistan
- Studying in 2nd semester or
- Dropped out in 2nd – 8th semester or
- Passed out in BS(CS), BS(SE) or BS(IT) degree program
- All university students in Pakistan
- Main Characteristics of Universities in Pakistan
- Universities
- Total = 174
- Government Sector = 137 (78.74%)
- Private = 37 (21.26%)
- Main Computer Science Degree Programs
- BS(CS)
- BS(IT)
- BS(SE)
- MCS
- MIS
- MS(CS)
- PhD (CS)
- Duration of Degree Programs
- BS Programs – 4 years (8 semesters)
- Masters Programs – 2 years (4 semesters)
- MS / MPhil Programs – 2 years (4 semesters)
- PhD Programs – 3 years
- Gender Distribution
- Percentage of Males and Females studying Computer Science and Information Technology in Pakistan is:
- Males = 86%
- Females = 14%
- Percentage of Males and Females studying Computer Science and Information Technology in Pakistan is:
- Pakistani Universities Rule for Drop Out
- Rules
- If a student gets GPA < 2.0 in 1st semester, (s)he will go on Probation
- In the 2nd semester, if student does not improve and his / her GPA < 2.0, then (s)he will be dropped out
- Rules
- Conclusion
- To conclude, if we help a student to avoid probation in the 1st semester, then it is likely (s)he will not drop out at the end of 2nd semester
- Universities
- Representative Sample
- To draw a Representative Sample, I would collect data with following characteristics
- Characteristics
- Universities
- Insha Allah, I will collect Sample Data from both
- Government Sector Universities and
- Private Sector Universities
- For example, if I collect Sample Data from 10 universities then there will be
- 8 (80%) Government Sector Universities and
- 2 (20%) Private Sector Universities
- Insha Allah, I will collect Sample Data from both
- Universities
- Commuter Science Degree Programs
- Insha Allah, I will collect Sample Data from the following disciplines
- BS(CS)
- BS(IT)
- BS(SE)
- Gender Distribution
- Insha Allah (انشاء اللہ), I will collect Sample Data, in which
- 86% Data will be of Male students
- 14% Data will be of Female students
- Insha Allah (انشاء اللہ), I will collect Sample Data, in which
- Insha Allah, I will collect Sample Data from the following disciplines
- Commuter Science Degree Programs
- Issue 2 – Time Span
- I would collect Sample Data of university students in last 10 years
- e., From 1-1-2011 To 31-12-2021
- Assumption
- The university student drop out pattern in last 10 years is likely to be almost same as present day
- Issue 3 – Size of Corpus
- I target that my proposed corpus should contain
- 20,000 instances (or student’s records)
- TIP
- To hit the target of 20,000, try to collect
- 30, 000 instances (or student’s records)
- Remember
- To hit the target, always aim a bit higher 😊 Good Luck
- To hit the target of 20,000, try to collect
- Issue 4 – Potential Data Sources
- Potential Data Sources
- I aim to collect Sample Data from the following 10 universities in Pakistan
- Government Sector Universities
- The University of Punjab, Lahore
- University of Engineering and Technology (UET), Lahore
- Bahauddin Zakariya University (BZU), Multan
- COMSATS University, Lahore
- COMSATS University, Islamabad
- COMSATS University, Vehari
- COMSATS University, Sahiwal
- COMSATS University, Wah Cantt
- Private Sector Universities
- The University of Lahore, Lahore
- University of Management and Technology, Lahore
- Government Sector Universities
- I aim to collect Sample Data from the following 10 universities in Pakistan
- Reasons for Selecting these Data Sources
- I have links in BZU, Multan and UET, Lahore because I did my
- BS(CS) from BZU, Multan and
- MS(CS) from UET, Lahore
- I am teaching as Assistant Professor in COMSTAS UNIVERSTIY, Lahore Campus, therefore, Insha Allah, I can easily collect Sample Data from
- COMSATS University, Lahore
- COMSATS University, Islamabad
- COMSATS University, Vehari
- COMSATS University, Sahiwal
- COMSATS University, Wah Cantt
- My friends are teaching in the following universities, In Sha Allah they will help me to collect Sample Data
- The University of Lahore, Lahore
- University of Management and Technology, Lahore
- The University of Punjab, Lahore
- All 10 universities are well-established and offering BS degree programs in Computer Science from last 15 years
- All 10 universities will provide authentic data and it will be appropriate for developing Gold Standard Annotated Corpus to Train / Test GPA Prediction system
- All 10 universities have thousands of Computer Science students and will be easy to collect sufficiently large Sample Data
- I have links in BZU, Multan and UET, Lahore because I did my
- Potential Challenges in Collecting Data from Data Sources
- Problem 1
- A student’s academic record is private data
- How to collect it?
- A Possible Solution
- Write an application to Director / V.C. of the university and explain him
- What is your proposed GPA Prediction system?
- How will you ensure the Data Protection (for e.g., by anonymizing student’s academic records)?
- What potential benefits the proposed GPA Prediction will have by reducing the number of universities drop outs?
- Write an application to Director / V.C. of the university and explain him
- A student’s academic record is private data
- Problem 1
- Issue 5 – High-quality Data
- Insha Allah, I hope Sample Data collected from universities will be of high-quality because
- All 10 universities store, retrieve and manipulate their student’s data using state-of-the-art Relational Database Management Systems
- Issue 6 – Balanced Data
- Insha Allah, I aim to collect 20K student’s data, with following distributions
- Gender Distribution
- 17,200 Male Students (86%)
- 2,800 Female Students (14%)
- Drop Out Vs Not Drop Out
- 10, 000 Drop Out Students (50%)
- 10, 000 Not Drop Out Students (50%)
- Gender Distribution
- Issue 7 – Data Collection
- To develop Gold Standard Annotated Corpus, I need Sample Data from universities in
- Structured and
- Digital Format
- I will request universities to share data in
- CSV Format
- Reason
- It is easy to export data in CSV Format from a Relational Database
- Issue 8 – Annotation Guidelines
- In my targeted Data Sources, Output (GPA in 1st semester) is provided with the Input (Matric Marks and F.Sc. Marks)
- So, I don’t need to design Annotation Guidelines
- Issue 9 – Annotators
- In my targeted Data Sources, Output (GPA in 1st semester) is provided with the Input (Matric Marks and F.Sc. Marks)
- So, I don’t need Annotators
- Issue 10 - Data Protection
- Problem
- A student’s academic record is private data
- How to protect his / her identify?
- A student’s academic record is private data
- A Possible Solution
- I will request universities to provide only five attributes of each student and do not disclose his / her identity
- Matric Marks
- Sc. Marks
- GPA in 1st semester
- Gender
- Degree Program (BS(CS) / BS(IT) / BS(SE))
- Consequently, there will be no Data Privacy issues
- I will request universities to provide only five attributes of each student and do not disclose his / her identity
- Insha Allah, I will also make my proposed corpus
- Free and publicly available for research purposes 😊
- Remember
- To become a great personality
- Serve the Humanity for the Sake of Allah (مخلوق خدا کی بےلوث خدمت)
- To become a great personality
- Issue 11: Corpus Standardization
- Insha Allah, I will standardize my corpus in
- CSV Format
- Step 4: Developing Proposed Annotated Corpus at Prototype Level
- Developing Proposed Annotated Corpus at Prototype Level
- The biggest advantage of doing a task at prototype level is that
- Through manual inspection, you can verify whether your automatic approach (or a process) is working correctly or not?
- Remember
- If you cannot successfully execute a task at prototype level
- You can never execute it at Real-world level
- If you cannot successfully execute a task at prototype level
- The biggest advantage of doing a task at prototype level is that
- Note
- In Sha Allah, I will develop a small (at prototype level) Annotated Gold Standard Corpus of 15 instances
- I will also list down the problems (along with their Possible Solutions) that I encountered in developing Annotated Gold Standard Corpus at prototype level
- Note
- Steps – Annotated Corpus Creation Process
- When you create your proposed Annotated Corpus at prototype and / or full scale level, follow the following main steps
- Step 1: Raw Data Collection
- Step 1.1: Data Cleaning (if needed)
- Step 1.2: Data Pre-processing (if needed)
- Step 2: Annotation Process
- Step 2.1: Annotation Guidelines
- Step 2.2: Annotations
- Step 2.3: Inter-Annotator Agreement (IAA)
- Step 3: Corpus Characteristics and Standardization
- Step 1: Raw Data Collection
- In Sha Allah, I will follow the following steps to collect Raw Data
- Step 1: Write an application to Directors / V.Cs. of the 10 targeted universities and request them to share their data for my research project (GPA Prediction system)
- Step 2: Store data collected from all 10 universities in separate CSV files
- I will make separate CSV Data Files for each Data Source
- uop-data.csv
- uet-data.csv
- bzu-data.csv
- comsats-lahore-data.csv
- comsats-islamabad-data.csv
- comsats-vehari-data.csv
- comsats-sahiwal-data.csv
- comsats-wahcantt-data.csv
- uol-data.csv
- umt-data.csv
- Note
- You will need to use your personal links as well to
- Get data and
- Speed up data collection process
- You may not get data from all 10 universities
- Therefore, if you want to collect data from 10 universities
- At least target 50 universities
- Remember
- 80-20 Rule (a.k.a. Pareto Principle)
- 80% of outcomes (outputs) come from 20% of causes (inputs)
- Example
- You spend 80% of your time with 20% people
- 80-20 Rule (a.k.a. Pareto Principle)
- Therefore, if you want to collect data from 10 universities
- You will need to use your personal links as well to
- I will make separate CSV Data Files for each Data Source
- Step 3: I will combine data from all 10 CSV Data Files with following data distributions (called combine-data.csv)
- Gender Distribution
- Total Instances = 20, 000
- Male Instances = 17, 200 (86%)
- Female Instances = 2, 800 (14%)
- Drop Out Vs Not Drop Out
- Total Instances = 20, 000
- Drop Out Instances = 10, 000 (50%)
- Note Drop Out Instances = 10, 000 (50%)
- Note
- If you get more than 20K instances, then keep it
- Recall
- Our minimum Sample Data target was 20K
- If we can get more than 20K instances, then it is very good news 😊
- Our minimum Sample Data target was 20K
- Gender Distribution
- Step 3: I will combine data from all 10 CSV Data Files with following data distributions (called combine-data.csv)
- Step 4: Pre-process Data (combine-data.csv) by
- Removing incomplete records and
- Only keeping three attributes
- Matric Marks
- F۔Sc۔ Marks
- GPA in 1st semester
- We call this Data file pre-processed-data.csv
- Step 4: Pre-process Data (combine-data.csv) by
- Step 2: Annotation Process
- Note
- For Annotation Process, I will use the cleaned and pre-process data (pre-processed-data.csv)
- Step 3: Corpus Characteristics and Standardization
- I will summarize the main characteristics of my Gold Standard Annotated Corpus in a Table
- Main Characteristics of Gold Standard Annotated Corpus
- Total number of instances in corpus
- Number of Male instances in corpus
- Number of Female instances in corpus
- Number of students in BS(CS) degree program
- Number of students in BS(SE) degree program
- Number of students in BS(IT) degree program
- Total number of students dropped out
- Total number of students not dropped out
- Number of Male students dropped out
- Number of Female students not dropped out
- Corpus Standardization
- I standardized my Gold Standard Annotated Corpus in
- CSV Format
- I standardized my Gold Standard Annotated Corpus in
- Important Note
- In Sha Allah, I will release Gold Standard Annotated Corpus in two versions
- First Version – Containing Five Attributes
- Matric Marks
- F۔Sc۔ Marks
- GPA in 1st semester
- Gender
- Degree Program
- See gpa-prediction-all-attributes.csv file in Supporting Material
- Note
- This version can be used to analyze Gold Standard Annotated Corpus
- Second Version – Containing Three Attributes
- Matric Marks
- F۔Sc۔ Marks
- GPA in 1st semester
- See gps-prediction-data.csv file in Supporting Material
- Note
- This version can be used to immediately Train / Test Machine Learning Algorithms to develop GPA Prediction system
- First Version – Containing Five Attributes
- In Sha Allah, I will release Gold Standard Annotated Corpus in two versions
- Step 5: Developing Proposed Annotated Corpus at Full Scale Level
- In this step, In Sha Allah, I will develop the full scale corpus of 20K instances using
- The same steps I followed in Step 4: Developing Proposed Annotated Corpus at Prototype Level
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 3
- Task
- Government of Saudia Arabia wants to predict, how many people will come for Hajj in 221, so that they can make arrangements To accurately predict: Number of People Performing Hajj in 221 (Output), the Government wants to develop a Hajj Prediction System. Government of Saudi Arabia has hired you for the development of Hajj Prediction System (Regression Problem).
- Dua
|
- Note
- Your answer should be
- Well Justified
- Questions
- Write Input and Output for the above Real-world Problem?
- What type of Data Sources will be available for creating corpus for mentioned Real-world Problem?
- Sources with Annotations
- Sources without Annotations
- Apply Main Steps for Data Annotation to create your Gold Standard Annotation Corpus for the task: Hajj Prediction in 221
- Your answer should be
Your Turn Tasks
Your Turn Task 3
- Task
- Select a Real-world Problem and write it in one sentence (similar to: Hajj Prediction in 221 (Regression Problem)). Your task is to create a Gold Standard Annotated Corpus for your Real-world Problem. However, you must select Data Sources with Annotations to collect Sample Data for creating Gold Standard Annotated Corpus.
- Note
- Your answer should be
- Well Justified
- Your answer should be
- Questions
- Write Input and Output for your selected Real-world Problem.
- Apply Main Steps for Data Annotation to create your Gold Standard Annotation Corpus for the selected Real-world Problem.
Developing Gold Standard Annotated Corpus using Data Sources without Annotations
Example 1 – Developing Gold Standard Annotated Corpus using Data Sources without Annotations – A Step by Step Example
- Main Steps for Data Annotation
- Step 1: Completely and correctly understand the Real-world Problem
- Step 2: Check if the Real-world Problem can be treated as a Machine Learning Problem?
- If Yes
- Go to Next Step
- If Yes
- Step 3: Write down Possible Solution(s) to the Annotated Corpus Development Issues discussed in Lecture 3 – Data and Annotations
- Note that Possible Solution(s) to each Annotated Corpus Development Issue should be well justified
- Step 4: Develop proposed Annotated Corpus at the Prototype Level
- Record the problems that you faced in developing Annotated Corpus at prototype level
- Write down Possible Solution(s) to handle problems that encountered in developing Annotated Corpus at prototype level
- Step 5: Develop proposed Annotated Corpus at full scale
- Step 1: Completely and correctly understand the Real-world Problem
Understanding Real-world Problem
- Real-world Problem
- On Social Networking and Companies Websites, billions of people post their comments on daily basis. However, it is not known what are the emotions of people about a particular event / product. Manual emotion analysis from billions of posts (texts) is practically not possible.
- A Possible Solution
- Develop an emotion prediction system, which accurately predicts the emotions in a Text
- Real-world Problem
- Step 2: Can We Treat Real-world Problem as a Machine Learning Problem?
- Checking If Real-world Problem Can be Treated as a Machine Learning Problem?
- Recall Lecture 2
- Majority of Machine Learning involves
- Learning Input-Output Functions
- Conclusion
- If you can break a Real-world Problem into
- Input and Output then
- You can treat that Real-world Problem as a Machine Learning Problem 😊
- If you can break a Real-world Problem into
- Majority of Machine Learning involves
- Goal
- Develop an automatic emotion prediction system
- Input
- A Text
- Output
- Emotion
- Possible Output Values
- Anger, Anticipation, Disgust, Fear, Joy, Love, Optimism, Pessimism, Sadness, Surprise, Trust, Neutral (or No Emotion)
- Conclusion
- Yes, we can treat Emotion Prediction Problem (Real-world Problem) as a Machine Learning Problem😊
- Note
- Emotion Prediction Problem is a (Multi-class) Classification Problem because
- Output = Categorical
- Number of Outputs = 12 Categories (Emotions)
- Emotion Prediction Problem is a (Multi-class) Classification Problem because
- Recall Lecture 2
- Step 3: Possible Solutions to Annotated Corpus Development Issues
- Research Focus plays a key role in Annotated Corpus Development Process
- Research Focus and Annotated Corpus Development Issues
- Example – Research Focus
- Research Focus 1
- Develop an automatic Emotion Prediction system for all Social Networking Websites and Companies Websites
- Remarks / Comments
- It will be practically impossible to collect large and high-quality data from all Social Networking Websites and Companies Websites
- Also, the amount of time and effort required to collect data will be more than the amount of work required for a MS / PhD thesis
- It will be practically impossible to collect large and high-quality data from all Social Networking Websites and Companies Websites
- Conclusion
- To conclude, the Research Focus is very broad, we need to narrow it down
- Research Focus 2
- Develop an automatic Emotion Prediction system for all Social Networking Websites
- Remarks / Comments
- It will be very difficult and challenging to collect large and high-quality data from all Social Networking Websites
- Also, the amount of time and effort required to collect data will be more than the amount of work required for a MS / PhD thesis
- Conclusion
- To conclude, the Research Focus is broad, we need to narrow it down
- Research Focus 3
- Develop an automatic Emotion Prediction system for International Political Tweets posted on Twitter
- Remarks / Comments
- It will be possible to collect large and high-quality data of International Political Tweets from Twitter
- Also, the amount of time and effort required to collect data will be enough to satisfy the amount of work required for a MS / PhD thesis
- Conclusion
- To conclude, the Research Focus is fine 😊
- Note
- Before submitting a FYP / MS / PhD thesis, answer the following question
- Question
- Is this work enough for FYP / MS / PhD thesis?
- Answer
(Prof. Robert Gaizauskas, University of Sheffield, UK) |
- Tip
- When planning to develop an annotated corpus
- Step 1: Write down your Research Focus in a single sentence:
- Step 2: Search on internet for potential Data Source(s) and discuss with your Supervise (or seniors)
- Step 3:
|
- Remember
- To be successful in life, only take
- Calculated Risks 😊
- To be successful in life, only take
- Research Focus for Our Current Example
- Research Focus
- Develop an automatic Emotion Prediction system for International Political Tweets posted on Twitter
- Issue 1 – Data Sampling
- Population
- All International Political Tweets posted on Twitter
- Main Characteristics of Twitter
- Languages
- Tweets are published on Twitter in almost all the widely spoken languages of the world
- Twitter Pages on International Politics
- Global Politics (@Globalpoliticss)
- International Politics & Society (@ips_journal)
- International Affairs (@IAJournal_CHPolitics)
- And many more
- Note
- You can rank Twitter Pages on International Politics based on
- of Tweets Posted
- of Followers
- of Following
- And other metrics
- Length of a Tweet
- Maximum length of a Tweet is 280 characters
- Average length of a Tweet is 33 characters
- Only 1% of Tweets have length of 280 characters
- You can rank Twitter Pages on International Politics based on
- Languages
- Representative Sample
- To draw a Representative Sample, I would collect Sample Data with following characteristics
- Characteristics
- Language
- English
- Twitter Pages on International Politics
- Step 1: I will make a ranked list of top Twitter Pages on International Politicos
- Step 2: I will extract Tweets on International Politics from
- Top 10 International Politics Tweet Pages
- Language
- Issue 2 – Time Span
- I would collect Tweets on International Politics published in last one year
- e. From 1-1-2021 To 31-12-2021
- Assumption
- The language used in International Political Tweets in last one year is almost same as present day
- Issue 3 – Size of Corpus
- I target that my proposed corpus should contain
- 120, 000 Tweets on International Politics
- TIP
- To hit the target of 120, 000, try to collect
- 200, 000 Tweets on International Politics
- Remember
- To hit the target, always aim a bit higher 😊 Good Luck
- To hit the target of 120, 000, try to collect
- Issue 4 – Potential Data Sources
- Potential Data Sources
- Top 10 Twitter Pages on International Politics
- Reasons for Selecting these Data Sources
- The above-mentioned Data Sources are
- Authentic
- Appropriate to create Emotion Prediction system on International Politics
- Free and publicly available for research purposes
- Contain large number of English Tweets on International Politics in Digital Format
- The above-mentioned Data Sources are
- Potential Challenges in Collecting Data from Data Sources
- Problem 1
- Manual Data Collection of 120K Tweets on International Politics from Twitter is not practically possible
- A Possible Solution
- Use Twitter API to automatically extract English Tweets on International Politics from Twitter
- Problem 2
- How will you get access to Twitter website to automatically extract English Tweets on International Politics from Twitter
- A Possible Solution
- I will write an email to Twitter from my official email id explaining
- What is my proposed Emotion Prediction system?
- How will I ensure that Tweets will only be used for research purposes (not for any commercial purposes)
- What will be the potential benefits of the proposed Emotion Prediction system?
- I will write an email to Twitter from my official email id explaining
- Problem 3
- Automatically extracting Tweets will produce noise in data
- A Possible Solution
- I will write a program to automatically clean the Tweets
- Problem 4
- Twitter does not allow you to collect Tweets using their Twitter API
- A Possible Solution
- Use Crowdsourcing Approach
- i.e. use the Wisdom of the Crowd
- Use Crowdsourcing Approach
- Problem 1
- Crowdsourcing
- Definition
- The practice of obtaining input into a task / project by enlisting the services of a large number of people, either paid or unpaid, typically via the Internet is called Crowdsourcing
- Benefits
- Crowdsourcing can
- Reduce costs
- Speed up project timelines
- Tap into Crowd intelligence and creativity
- Engage citizens at all levels of corporate and government processes
- Steps to Successful Crowdsourcing
- Step 1: Design the Job and divide the labor
- Step 2: Write clear instructions
- Step 3: Choose a Web platform to serve as your Crowd Market
- Step 4: Release the Job and recruit the Crowd
- Step 5: Listen to the Crowd and manage the Job
- Step 6: Assemble the work of the Crowd and create the final product
- Crowdsourcing can
- Definition
- Example - Collecting Tweets on International Politics through Crowdsourcing
- Step 1: Design the Job and divide the labor
- I have made a list of Top 10 Tweet Pages on International Politics
- One worker will collect data from one Twitter Page on International Politics
- Step 2: Write clear instructions
- Instructions
- Only collect Tweets in English language
- Collect 20K Tweets
- For each Tweet save it’s
- Date
- URL
- Tweet should be published between 1-1-2011 to 31-1-2021
- Collect 500 – 1000 Tweets for each Month of the Year
- Tweets should contain one of the 12 emotions given below
- Anger, Anticipation, Disgust, Fear, Joy, Love, Optimism, Pessimism, Sadness, Surprise, Trust, Neutral (or No Emotion)
- I will pay 1 for each Tweet
- Instructions
- Step 3: Choose a Web platform to serve as your Crowd Market
- I advertised my Job on Amazon Mechanical Turk
- Reason
- Amazon Mechanical Turk is a popular and widely used Crowdsourcing platform
- Step 4: Release the Job and recruit the Crowd
- I had initial discussion and reviewed the initial task done by each Worker
- I shortlisted 10 Workers and gave them the Job
- Step 5: Listen to the Crowd and manage the Job
- I monitored their work on regular basis
- Step 6: Assemble the work of the Crowd and create the final product
- I collected Tweets from 10 workers and combined them into one single file
- Issue 5 – High-quality Data
- I collected data from Twitter using Twitter API
- Problem 1
- When data will be extracted through Twitter API it will contain noise
- A Possible Solution
- Write a program to automatically clean data
- Problem 2
- Text in English Tweets may not be properly tokenized
- A Possible Solution
- Use an accurate Word Tokenizer to properly tokenize the Tweets
- Problem 1
- Note
- To improve quality of my Tweets data, I properly
- cleaned and tokenized it
- To improve quality of my Tweets data, I properly
- Issue 6 – Balanced Data
- There are total 12 Emotion Categories (see Issue 1 – Data Sampling)
- Insha Allah (انشاء اللہ), I aim to collect 10K Tweets for each Emotion Category
- Corpus Size = 12 x 10K
- Corpus Size = 120K
- Issue 7 – Data Collection
- The Sample Data that I need to develop my proposed Gold Standard Annotated Corpus is in
- Digital Format
- I will use Twitter API to automatically extract (or collect) Sample Data from Twitter (online Data Source)
- Issue 8 – Annotation Guidelines
- In my targeted Data Source, Output (Emotion) is not provided with the Input (Tweet)
- So, I need to design Annotation Guidelines
- Issue 9 – Annotators
- In my targeted Data Source, Output (Emotion) is not provided with the Input (Tweet)
- So, I need at least three Annotators
- Issue 10 - Data Protection
- Since I am using free and publicly available Tweets (Data) for research purposes
- Therefore, there are no Data Privacy issues
- Insha Allah (انشاء اللہ), I will also make my proposed corpus
- Free and publicly available for research purposes 😊
- Remember
- To become a great personality
- Serve the Humanity for the Sake of Allah (مخلوق خدا کی بےلوث خدمت)
- To become a great personality
- Issue 11: Corpus Standardization
- In Sha Allah, I will standardize my corpus in
- CSV Format
- XML Format
- Step 4: Developing Proposed Annotated Corpus at Prototype Level
- Developing Proposed Annotated Corpus at Prototype Level
- The biggest advantage of doing a task at prototype level is that
- Through manual inspection, you can verify whether your automatic approach (or a process) is working correctly or not?
- Remember
- If you cannot successfully execute a task at prototype level
- You can never execute it at Real-world level
- If you cannot successfully execute a task at prototype level
- Note
- In Sha Allah, I will develop a small (at prototype level) Annotated Gold Standard Corpus of 15 instances
- I will also list down the problems (along with their Possible Solutions) that I encountered in developing Gold Standard Annotated Corpus at prototype level
- The biggest advantage of doing a task at prototype level is that
- Steps – Annotated Corpus Creation Process
- When you create your proposed Annotated Corpus at prototype and / or full scale level, follow the following main steps
- Step 1: Raw Data Collection
- Step 1.1: Data Cleaning (if needed)
- Step 1.2: Data Pre-processing (if needed)
- Step 2: Annotation Process
- Step 2.1: Annotation Guidelines
- Step 2.2: Annotations
- Step 2.3: Inter-Annotator Agreement (IAA)
- Step 3: Corpus Characteristics and Standardization
- Step 1: Raw Data Collection
- In Sha Allah, I will follow the following steps to collect Raw Data
- Step 1: Write a program in Python programming language (called extract-tweets.py) to extract Tweets on International Politics from Twitter using Twitter API
- Step 2: Use extract-tweets.py to extract following information for each Tweet and store them in raw-data.csv
- Text of Tweet
- Date of Tweet
- URL of Tweet
- Step 3: Write a program in Python programming language (called remove-duplicates.py) and use that program to automatically remove duplicate Tweets from raw-data.csv (new file is called data.csv)
- Step 4: Write a program in Python programming language (clean-tweets.py) and use that program to clean Tweets text in data.csv by removing the following: (new file is called cleaned-data.csv)
- HTML tags
- Hyper links
- Foreign language characters etc.
- Step 5: Write a program in Python programming language (tokenize-tweets.py) and use that program to properly tokenize text in Tweets (new file is called cleaned-processed-data.csv)
- Step 2: Annotation Process
- Annotation Process will comprise of following three steps
- Annotation Guideline
- Annotations
- Inter-Annotator Agreement (IAA)
- Note
- For Annotation Process, I will use only 5 Tweets from the cleaned and pre-processed (properly tokenized) data stored in file called cleaned-processed-data.csv
- Annotation Guidelines
- Following are the instructions for Annotators to annotate Tweets on International Politics
- Annotators
- Standard Practice
- Use Three Annotators (A, B and C) for Data Annotation
- Note
- Can use more than Three Annotators for Data Annotations
- Characteristics of Annotators
- Annotators must be Domain Experts
- Annotators should be Expert and / or Native Speaker in the language in which text is written
- Annotators should be of good qualification
- Standard Practice
- Annotation Steps
- Generally, annotations are carried out in following four steps
- Step 1: Annotators A and B annotate a subset of the data
- Discuss Conflicting and Agreed annotations and refine the Annotation Guidelines to reduce the conflicts
- Step 2: Annotators A and B annotate the remaining data based on revised Annotation Guidelines
- Step 3: Annotator C annotates conflicting data
- Step 4: Compute Inter-Annotator Agreement
- Step 1: Annotators A and B annotate a subset of the data
- Generally, annotations are carried out in following four steps
- Inter-Annotator Agreement
- Definition
- Inter-Annotator Agreement (IAA) is a measure of how well two (or more) Annotators can make the same annotation decision for a certain category
- Definition
- Purpose
- IAA Is computed to derive two things
- How easy was it to clearly define the category?
- How trustworthy is the annotation?
- IAA Is computed to derive two things
- Kappa Coefficient
- Definition
- Kappa Coefficient measures the agreement between two Annotators, while taking into account the possibility of chance agreement
- Definition
- N = Total No of Instances
- d = Sum of Correctly Mapped Instances
- µ = Sum of the Product (Correctly Mapped Instances) between Annotators A and B for each Class / Category
- Standard Range of Kappa Coefficient
- Annotating Tweets on International Politics
- Separately give Tweets to Annotators A & B
Annotating Tweets on International Politics, Cont…
- Annotations by Annotator A
- Annotating Tweets on International Politics, Cont…
- Annotations by Annotator B
- Annotating Tweets on International Politics, Cont…
- Combine annotations of Annotators A and B to compute Inter Annotator Agreement (IAA) and Kappa Coefficient
- Conflict Resolution
- Give only conflicted Tweets to Annotator C
- Final Gold Standard Annotated Corpus
- Step 3: Corpus Characteristics and Standardization
- I will summarize the main characteristics of my Gold Standard Annotated Corpus in a Table
- Main Characteristics of Gold Standard Annotated Corpus
- Total number of Tweets in corpus
- Total number of words in corpus
- Total number of unique words in corpus
- Average length of a Tweet in corpus
- Total number of Tweets with Anger Emotion
- Total number of Tweets with Anticipation Emotion
- Total number of Tweets with Disgust Emotion
- Total number of Tweets with Fear Emotion
- Total number of Tweets with Joy Emotion
- Total number of Tweets with Love Emotion
- Total number of Tweets with Optimism Emotion
- Total number of Tweets with Pessimism Emotion
- Total number of Tweets with Sadness Emotion
- Total number of Tweets with Surprise Emotion
- Total number of Tweets with Turst Emotion
- Total number of Tweets with No Emotion (Neutral)
- Corpus Standardization
- I standardized my Gold Standard Annotated Corpus in
- CSV Format
- XML Format
- I standardized my Gold Standard Annotated Corpus in
Gold Standard Annotated Corpus in CSV Format
Gold Standard Annotated Corpus in XML Format
- Step 5: Developing Proposed Annotated Corpus at Full Scale Level
- In this step, I will develop the full scale corpus of 120K instances using
- The same steps I followed in Step 4: Developing Proposed Annotated Corpus at Prototype Level
TODO and Your Turn
Todo Tasks
Your Turn Tasks
Todo Tasks
TODO Task 4
- Task
- Ghufran wants to develop a Gold Standard Annotated Corpus for the following Real-world Problem
- Real-world Problem
- Detection of Offensive Language against Islam on Twitter
- Ghufran wants to treat this Real-world Problem as a Binary Classification Problem
- Note
- Your answer should be
- Well Justified
- Your answer should be
- Questions
- Write Input and Output for the above Real-world Problem?
- What type of Data Sources will be available for creating corpus for mentioned Real-world Problem?
- Sources with Annotations
- Sources without Annotations
- Apply Main Steps for Data Annotation to create your Gold Standard Annotation Corpus for the task: Detection of Offensive Language against Islam on Twitter
Your Turn Tasks
Your Turn Task 4
- Task 1
- Select a Real-world Problem and write it in one sentence (similar to: Detection of Offensive Language against Islam on Twitter). Your task is to create a Gold Standard Annotated Corpus for your Real-world Problem. However, you must select Data Sources without Annotations to collect Sample Data for creating Gold Standard Annotated Corpus.
- Note
- Your answer should be
- Well Justified
- Questions
- Write Input and Output for your selected Real-world Problem?
- Apply Main Steps for Data Annotation to create your Gold Standard Annotation Corpus for the selected Real-world Problem.
Chapter Summary
- Chapter Summary
- The five main Steps for Data Annotation are as follows
- Step 1: Completely and correctly understand the Real-world Problem
- Step 2: Check if the Real-world Problem can be treated as a Machine Learning Problem?
- If Yes
- Go to Next Step
- Step 3: Write down Possible Solution(s) to the Annotated Corpus Development Issues discussed in Lecture 3 – Data and Annotations
- Note that Possible Solution(s) to each Annotated Corpus Development Issue should be well justified
- Step 4: Develop proposed Annotated Corpus at the Prototype Level
- Record the problems that you faced in developing Annotated Corpus at prototype level
- Write down Possible Solution(s) to handle problems that encountered in developing Annotated Corpus at prototype level
- Step 5: Develop proposed Annotated Corpus at full scale
- If Yes
- When you create your proposed Annotated Corpus at prototype and / or full scale level, follow the following main steps
- Step 1: Raw Data Collection
- Step 1.1: Data Cleaning (if needed)
- Step 1.2: Data Pre-processing (if needed)
- Step 2: Annotation Process
- Step 2.1: Annotation Guidelines
- Step 2.2: Annotations
- Step 2.3: Inter-Annotator Agreement (IAA)
- Step 3: Corpus Characteristics and Standardization
- An authentic and appropriate Data Source is essential to develop a large Gold Standard Annotated Corpus
- As discusses in Lecture 3 – Data and Annotations, the main types of Data Sources are
- Sources with Annotations
- Sources without Annotations
- Sources with / without Annotations can be
- Online Digital Repositories
- Non-digital Repositories
- Existing Corpora
- Supervised Machine Learning Problems can be broadly categorized into three main types
- Classification Problems
- Regression Problems
- Sequence to Sequence Problems
- For each type of Machine Learning Problem
- Suitable Machine Learning Algorithms may differ
- Classification Problems – Input and Output
- Input
- Structured / Unstructured / Semi-structured
- Output
- Categorical
- Input
- Regression Problems – Input and Output
- Input
- Structured / Unstructured / Semi-structured
- Output
- Numeric
- Input
- Sequence to Sequence Problems – Input and Output
- Input
- Unstructured (of variable length)
- Output
- Unstructured (of variable length)
- Input
- In this Lecture, we have discussed four Step by Step examples to created Gold Standard Annotated Corpus
- Data Sources with Annotations
- Developing a Gold Standard Annotated Corpus for Urdu Text Summarization Task
- Developing a Gold Standard Annotated Corpus for GPA Prediction Task
- Data Sources without Annotations
- Developing a Gold Standard Annotated Corpus for Emotion Precision on Tweets
- Developing a Gold Standard Annotated Corpus for Urdu-English Machine Translation Task
- Data Sources with Annotations
In Next Chapter
- In Next Chapter
- In Sha Allah, in the next Chapter, I will present a detailed discussion on
- Treating a Problem as a Machine Learning Problem - Step by Step Examples