Chapter 4 - Data and Annotation
Chapter Outline
- Chapter Outline
- Quick Recap
- Identifying Most Suitable Solution to a Real-world Problem
- Corpus / Dataset
- Data and Annotations
- Chapter Summary
Quick Recap
- Quick Recap – Basics of Machine Learning
- Before learning / doing a Task, it is essential to know its ultimate goal
- The ultimate goal of Machine Learning is
- Develop such a Machine which behaves like Human
- To stay motivated and successful in this life and hereafter
- Be Realistic
- To be honest, it is not possible to develop such a Machine which behaves like Human because
- Human (creature) can never be perfect like Allah (Creator)
- Machine Learning is mainly based on Inductive Learning Approach, which has Scope of Error
- Therefore, all Machine Learning Models will have Scope of Error and Machine cannot be intelligent like Human
- A Machine is said to learn if its today (character) is better than his yesterday (character)
- Generally , we use Machine Learning
- When we have Data with three main characteristics
- Large amount of Data
- High quality Data
- Balanced Data
- When patterns exist in our Data
- Even if we don’t know what they are?
- When we have Data with three main characteristics
- In Traditional Programming, a Program Processes the Data according to a Set of Instructions to produce Output
- To solve a Real-world Problem through Programming, you need to know four things
- Purpose
- Input
- Data
- Instructions
- Processing
- Output
- The diagram below summarizes how Traditional Programming works
- The main job of a Machine Learning Algorithm is to
- Learn from Input to predict Output
- Input (Machine Learning Algorithm)
- Data
- Input, Output
- Data
- Output (Machine Learning Algorithm)
- An intelligent Program (a.k.a. Model)
- The diagram below summarizes how Machine Learning works
- Machine Learning, can be summarized in the following Equation
- Most of Machine Learning revolves around
- Learning Input-Output Functions
- a.k.a. Function Learning or Concept Learning
- Learning Input-Output Functions
- Learner (Machine Learning Algorithm) – Input and Output
- Input to a Learner
- Set of Training Examples (D)
- Set of Functions / Hypothesis (H) (a.k.a. Hypothesis Space)
- Output of a Learner
- A h from H , which is an approximation of the Target Function f
- Output of a Learner
- The diagram below summarizes the general setting of Learning Input-Output Functions
- In Machine Learning, a Target Function f cannot be completely learned because
- Machine Learning uses Inductive Learning Approach, which has Scope of Error
- Therefore, Target Function (f) cannot be completely learned, however it can be approximated
- Hypothesis (h) is an approximation of the Target Function f
- Machine is dump and cannot understand Real-world Objects directly
- Therefore, we need to represent Real-world Objects in a Format which Machine Learning Algorithms can understand
- Very often, Example (a.k.a. instance, data point or observation) is represented as
- Attribute-Value Pair
- Input is mostly Vector-valued
- Output is mostly Single-valued
- Values of Attributes can be
- Categorical / Ordinal – e.g. Male, Female, Yes, No
- Numeric
- Discrete – e.g. 10, 25, 10000
- Continuous – e.g. 3.5, 5.9
- Representation of Hypothesis (h) varies from ML Algorithm to ML Algorithm
- The main steps to Learn an Input-Output Function are as follows
- Step 1: Define the Concept to be learned
- Step 2: Take the examples of the Concept to be learned
- Step 3: Learn from Examples
- Step 4: Generalize the Concept learned from specific examples
- Learning Input-Output Functions can be summarized as
- Learn from Input to predict Output
- To build efficient Machine Learning Models, remember the following Equation
- Machine Learning Data = Understanding and Preprocessing (40 %-50 %)+Predictive Analysis (50 %-60 %)
- To build efficient Machine Learning Models follow the following steps
- Step 1: Build strong and accurate understanding of Data
- Step 2: Properly pre-process Data, so that it becomes high-quality data
- Step 3: Represent / Transform Data into a format which Machine Learning Algorithms can understand
- Step 4: Identify what Machine Learning Algorithms will be most suitable for your Data (prepared in Step 3)
- Step 5: Train / Test selected (in Step 4) Machine Learning Algorithms on Data (prepared in Step 3)
- It is very difficult to build an efficient Model unless and until you have good understanding of your Data
- To have strong and more accurate understanding of your Data
- First perform manual inspection and then carry out automatic analysis
- Feature-based ML Algorithms are based on Manual Feature Engineering and Deep Learning ML Algorithms are based on Automatic Feature Engineering
- To select most suitable ML Algorithms for a Machine Learning Problem
- You must have high level of expertise in Machine Learning or consult a Machine Learning Expert
- No one has a definite answer about which Machine Learning Algorithms are most suitable for a Machine Learning Problem but we can start with those Machine Learning Algorithms which have proven effective in solving Machine Learning Problem(s) similar to our Machine Learning Problem
- Data is defined as Raw Facts and Figures
- Information is defined as the processed form of Data
- Data can be mainly categorized in three ways
- Form of Data – First Categorization of Data
- Text
- Image
- Video
- Audio
- Type of Data – Second Categorization of Data
- Structured
- Unstructured
- Semi-structured
- Length of Data – Third Categorization of Data
- Fixed
- Variable Length
- Data Annotation (a.k.a. data labeling / data tagging) is the process of labeling Data
- Data is annotated so that Machine Learning Algorithms can more accurately learn from annotated data
- Same Data can be annotated for different Tasks
- For Machine Learning Algorithms, Data is mainly available as
- Un-annotated Data
- Output is not associated with Input
- Annotated Data
- Output is associated with Input for all instances
- Semi-annotated Data
- Output is associated with Input for some instances
- Un-annotated Data
- For more accurate learning , it is important to have
- Large amount of Data
- High-quality Data
- Balanced Data
- The amount of Data required to accurately learn a Task
- Varies from Task to Task
- Data is said to be of High-quality if it contains instances which are
- Noise Free
- Complete and Correct
- Diversified
- Normally, Data Pre-processing tools are used to improve the quality of Data
- Balanced Data means that for each Class , the dataset must contain the same number of instances
- It is a difficult and challenging task to create large balanced datasets with high-quality
- Therefore, use large and high-quality datasets, which are moderately balanced
- A Machine Learning Problem can be mainly categorized as
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- In Supervised Learning, a Machine Learning Algorithm learns from Annotated Data
- In Unsupervised Learning, a Machine Learning Algorithm learns from Unannotated Data
- In Semi-supervised Learning, a Machine Learning Algorithm learns from Semi-annotated Data
- Supervised Learning is broadly categorized into
- Classification
- Output is Categorical
- Regression
- Output is Numeric
- Classification
- For each type of learning , different Machine Learning Algorithms are developed
- Good Starting Points – ML Algorithms for Supervised Learning
- Naïve Bayes
- Random Forest
- Support Vector Machine
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
- Good Starting Points – ML Algorithms for Classification
- Naïve Bayes
- Random Forest Classifier
- Support Vector Machine Classifier
- Logistic Regression
- Gradient Boosting
- Multi-layer Perceptron
- K-Nearest Neighbors
- Good Starting Points – ML Algorithms for Regression
- Random Forest Regressor
- Support Vector Machine Regressor
- Logistic Regression
- Linear Regression
- Good Starting Points – ML Algorithms for Unsupervised Learning
- K-means Clustering Algorithm
- Agglomerative Hierarchical Clustering Algorithm
- Mean-Shift Clustering Algorithm
- DBSCAN – Density-Based Spatial Clustering of Applications with Noise
- EM using GMM – Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
- Good Starting Points – ML Algorithms for Semi-supervised Learning
- Label Propagation
- Four main Phases of Machine Learning Cycle are
- Training / Learning Phase
- Testing / Evaluation Phase
- Application Phase
- Feedback Phase
- In Machine Learning, we split Data into
- Train Data (or Train set)
- Test Data (or Test set)
- Generally, we split Data in a Train-Test Split Ratio of 67%-33%
- Training Data is used in the Training Phase and Test Data is used in the Testing Phase
- Annotation of Training Data varies for three types of learning
- Test Data must be annotated for all three types of learning
- In Training Phase, we
- Use Training Data to build a Model
- In Testing Phase, we
- Evaluate the performance of Model on unseen Data (Test Data) using standard Evaluation Measure(s)
- Classification – Standard Evaluation Measures
- Baseline Accuracy
- Accuracy
- Precision
- Recall
- F-measure
- Area Under the Curve (AUC)
- Regression – Standard Evaluation Measures
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R² or Coefficient of Determination
- Adjusted R²
- Both Training and Testing Phases can be summarized as follows
- Recall the Equation
- Data = Model + Error
- In Training Phase, Model is build using the Training Data
- In Testing Phase, Test Data is used to check the Error in the Model
- If a Model performs well on large Test Data then it is
- Deployed in Real world to make predictions on future unseen instances (Application Phase)
- In Feedback Phase, we take Feedback from Users and Domain Experts on your deployed Model and try to further improve it based on Feedback received
- Training Regime is a systematic way in which Training Data is used by a Machine Learning Algorithm to learn from it
- Three main types of Training Regimes are
- Batch Method
- In this method, all training examples are available and used all at once to build the Model (or hypothesis h)
- Incremental Method
- In the method, one member (training example) of the Training Data is selected at a time and use to modify the current hypothesis (h)
- On-line Method
- If training instances become available one at a time and are used as they become available, the method is called an on-line method
- Whatever Training Regime you use, the Learner (Machine Learning Algorithm) will return you a hypothesis (h), which is an approximation of the target function
Identifying Most Suitable Solution to a Real-world Problem
- Real-world Problem
- Definition
- A Real-world Problem is defined as a matter or situation regarded as unwelcome or harmful and needing to be dealt with and overcome
- Purpose – Why Solve a Real-world Problem?
- To improve the quality of human life
- Example
- Real-world Problem
- You are walking in rain and getting wet 😊
- A Possible Solution
- Use an Umbrella 😊
- Real-world Problem
- Important Note
- There can be multiple solutions to a Real-world Problem
- Crucial thing is to identify the most suitable Solution(s)
- Steps - How to Identify Most Suitable Solution to a Real-world Problem?
- Step 1: Write down the Real-world Problem and your current circumstances
- Step 2: Completely and correctly understand the situation
- Step 3: List down the Possible Solutions that you know
- Step 4: Consult Domain Experts and people who faced similar Real-world Problem in the past and update your List of Possible Solutions
- Step 5: Write down strengths and weaknesses of each Possible Solution
- Step 6: Shortlist and rank 3 Possible Solutions, that seems to be most suitable Solutions
- Step 7: Consult a Domain Expert and select the one which seems to be most suitable to solve the Real-world Problem in your current situation
- Step 8: Apply the selected Solution to solve your Real-world Problem
- Example - Steps (How to Identify Most Suitable Solution of a Real-world Problem?)
In the next slides, In Sha Allah I will try to explain these steps with an example
- For the example, I am considering Computer Science Faculty Members at COMSATS University, Lahore Campus
- Example – Step 1: Real-world Problem and Current Circumstances
- Real-world Problem
- Find a good and suitable Supervisor for MPhil thesis
- Current Circumstance
- I am studying in 2nd semester at COMSATS University, Lahore Campus
- I have taken all Core Courses and three Elective Courses
- Elective Courses
- Advanced Topics in Machine Learning
- Natural Language Processing
- Digital Image Processing
- Elective Courses
- My Undergrad Project (Final Year Project (FYP)) focused on
- Automatic Signature Recognition using Artificial Intelligence (AI) Techniques
- Future Research Work Plans
- I want to continue research in the field of Artificial Intelligence (AI)
- Step 2: Completely and Correctly Understanding the Situation
- This is a very important decision of my life
- I will need reference of my Supervisor for job, further studies etc.
- It is a lifetime relationship and if anything goes wrong, consequences will be devastating
- Step 3: Make a List of Potential Supervisors
- During my MPhil studies, I know the following Teachers working in the domain of Artificial Intelligence (AI)
- Teachers – Who Taught Me
- Usama Ijaz Bajwa
- Course Taught – Digital Image Processing
- Waqas Anwar
- Course Taught – Natural Language Processing
- Rao Muhammad Adeel Nawab
- Course Taught – Advanced Topics in Machine Learning
- Usama Ijaz Bajwa
- I visited COMSATS website and found some more people working in AI
- Zulfiqar Habib
- Wajahat Mahmood Qazi
- My List of Potential Supervisors in AI
- Usama Ijaz Bajwa
- Waqas Anwar
- Rao Muhammad Adeel Nawab
- Zulfiqar Habib
- Dr. Wajahat Mahmood Qazi
- Step 4: Consultation
- I discussed my Real-world Problem with Domain Experts
- One of my Teachers
- Seniors who faced similar Real-world Problem in the past (Domain Expert)
- After discussion with Domain Experts, I updated my List of Potential Supervisors
- My Updated List of Potential Supervisors in AI
- Usama Ijaz Bajwa
- Waqas Anwar
- Rao Muhammad Adeel Nawab
- Zulfiqar Habib
- Wajahat Mahmood Qazi
- Atifa Ather
- Jawad Shafi
- Muhammad Salman Khan
- Dr. Aksam Iftikhar
- Step 5: Strengths and Weaknesses of Potential Supervisors
- After discussion with seniors and visiting webpages , I wrote down the strengths and weaknesses of each Potential Supervisor
- Note – Here I am only writing the strengths and weaknesses of only one Potential Supervisor
- Rao Muhammad Adeel Nawab
- Strengths
- PhD Students
- Passed out = 2
- Under supervision = 4
- MPhil Students
- Passed out = 40+
- Under supervision = 6
- Publications
- Impact Factor Journal = 16
- Conferences / Workshops = 21
- PhD Students
- Weaknesses
- Adeel doesn’t compromise on quality
- He is very strict in meeting deadlines and if a student fails to meet deadlines without a genuine reason, he immediately withdraws from supervision
- Strengths
- Step 6: Shortlisting and Ranking 3 Potential Supervisors
- During this whole process, I found out that in the field of AI my inclination is more towards
- Machine Learning and Natural Language Processing
- I shortlisted and ranked three Potential Supervisors
- Waqas Anwar
- Rao Muhammad Adeel Nawab
- Muhammad Salman Khan
- Step 7: Finalizing the Supervisor
- I met Dr. Waqas but he doesn’t have space to take more students
- I cannot work under the supervision of Dr. Waqas
- I met Dr. Adeel and he told me about strictly meeting the deadlines and some other rules
- I thought that I am not a great hard worker and things may get worst for me under Dr. Adeel’s supervision
- I met Dr. Salman and he told me that he is willing to supervise me
- My Final Decision in Current Situation
- I decided to do my MPhil thesis under the supervision of Dr. Salman
- Step 8: Start Working on MPhil Thesis
- I started working on MPhil thesis under the supervision of Dr. Salman
- Approaches to Solve a Real-world Problem
- Two main approaches to solve a Real-world Problem are
- Manual Approach
- Automatic Approach
- Example
- Problem
- Spam Email Detection
- Manual Approach
- A human will manually check each and every email, whether it is spam or not
- Automatic Approach
- A machine (or program) will automatically check each and every email, whether it is spam or not
- Problem
- Note
- Considering Spam Email Detection Real-world Problem
- It is practically impossible to manually detect Spam Emails
- Considering Spam Email Detection Real-world Problem
- Limitations of Manual Approach
- It requires a lot of effort, time and cost
- It is not practical when we have huge amount of data
- Question
- How to overcome the limitations of Manual Approach?
- A Possible Answer
- Build Intelligent Programs (or Machine Learning Models) to automatically perform a task
- Types of Automatic Approaches
- Two main types of automatic approaches are
- Rule-based Approach
- Machine Learning Approach
- Rule-based Approach
- In this approach, rules are manually extracted by humans (Domain Experts) to build Intelligent Programs (or Machine Learning Models)
- Machine Learning Approach
- Machine Learning Approach is a.k.a. Data Driven Approach
- In this approach, rules are automatically extracted from data to build Intelligent Programs (or Machine Learning Models)
- Machine Learning Problem
- Definition
- A Real-world Problem which can be represented , analyzed and
solved using Machine Learning Approach is called a Machine Learning Problem
- A Real-world Problem which can be represented , analyzed and
- Purpose
- Solving Real-world Problems using Machine Learning Approach will help to improve the overall quality of various tasks
- Examples
- Some of the Real-world Problems which can be treated as a Machine Learning Problems are
- Gender Identification
- Age Group Identification
- Fake News Detection
- Toxic Comment Detection
- Hate Speech Detection
- Spam Email Detection
- Machine Translation
- Text Summarization
- Sentiment Analysis
- Emotion Analysis
- Face Detection
- Face Recognition
- Object Detection from Image / Video
- Object Recognition from Image / Video
- Natural Language Description from Image
- Activity Recognition in Videos
- Speech to Text
- Text to Speech
- And many more 😊
- Some of the Real-world Problems which can be treated as a Machine Learning Problems are
- Real-world Problems and Machine Learning
- The main goal of Machine Learning is to develop Intelligent Programs / Models which can assist human beings in various Real-world tasks
- Question
- Can we treat every Real-world Problem as a Machine Learning Problem?
- Answer
- No
- Question
- Problem
- Every Real-world Problem cannot be treated as a Machine Learning Problem. How can I identify if my Real-world Problem can be treated as a Machine Learning Problem or not?
- Possible Solution 1
- Step 1: Completely and correctly understand the Real-world Problem
- Step 2: Develop strong understanding of Machine Learning
- Step 3: Answer the following question
- How will you transform your Real-world Problem into a Machine Learning Problem?
- Possible Solution 2
- Step 1: Completely and correctly understand the problem
- Step 2: Consult a Machine Learning Expert and ask him the following question
- Can your Real-world Problem be treated as a Machine Learning Problem?
- Question
- How to check if a Real-world Problem can be treated as a Machine Learning Problem or Not?
- Answer
- Recall Chapter 2 – Basics of Machine Learning
- Majority of Machine Learning involves
- Learning Input-Output Functions
- If (Real-world Problem can be broken into Input and Output)
- Then
- You can treat that Real-world Problem as a Machine Learning Problem 😊
- Then
- Majority of Machine Learning involves
- Recall Chapter 2 – Basics of Machine Learning
- Note
- In Sha Allah , this solution works for majority of Real-world Problems
TODO and Your Turn
TODO Task 1
- Task 1
- Real-world Problem
- You want to perform Umrah
- Question
- Apply Steps – How to Identify Most Suitable Solution to a Real-world Problem and design the most suitable solution to perform Umrah
- Real-world Problem
- Task 2
- Real-world Problem
- You want to become a balanced and characterful personality
- Question
- Apply Steps – How to Identify Most Suitable Solution to a Real-world Problem and identify the best habits that will make you a balanced and characterful personality
- Real-world Problem
- Task 3
- Consider the following Real-world Problems
- Information Retrieval
- Automatic Paraphrase Generation
- Author Region Identification
- Code Plagiarism Detection
- Fake News Detection
- Questions
- Can the above Real-world Problems be treated as Machine Learning Problems? If Yes, Explain. If No, Explain.
- If Yes, write down Input and Output for each Real-world Problem?
- Consider the following Real-world Problems
Your Turn Task 1
- Task 1
- Identify 5 Real-world Problems (not mentioned in this Chapter) which can be treated as Machine Learning Problems
- Question
- For each Real-world Problem
- Write Input and Output
- For each Real-world Problem
- Task 2
- Write down a Real-world Problem
- Question
- Apply Steps – How to Identify Most Suitable Solution to a Real-world Problem to find the most suitable solution to your Real-world Problem
Corpus / Dataset
- Treating a Problem as a Machine Learning Problem
Recall the Equation
- This shows that Data is backbone of Machine Learning
- i.e., why Machine Learning Approach is also called Data Driven Approach
- To develop Intelligent Programs (or Models) using Machine Learning Approach (or Data Driven Approach), we need
- Large amount of Data
- High-quality Data
- Balanced Data
- Remember the Quote
- The more you know someone, the more you love him 😊
- Corpus / Dataset – Machine Learning
- Definition
- In Machine Learning, Corpus / Dataset is defined as a collection of Real-world Data in Machine-readable Format
- Purpose
- To enable a Machine to learn from Corpus / Dataset to perform various useful tasks
- Importance
- Without Data, Machine cannot learn
- i.e., Machine Learning is futile without Data
- Without Data, Machine cannot learn
- Applications
- A range of successful Machine Learning Systems are being used around the world, which are build using Corpus / Dataset
- Google Real-world Applications
- Google Assistant
- Spell Checker
- Spam Email Detection
- Speech to Text
- Text to Speech
- Google Smart Email Reply
- and many more
- Amazon Real-world Applications
- Products Recommender Systems
- Alexa – Amazon’s Virtual Assistant
- Facebook Real-World Applications
- Face Recognition Systems
- Friend Recommender System
- Natural Language Processing (NLP)
- The starting point of NLP research is a Corpus / Dataset
- Google Real-world Applications
- A range of successful Machine Learning Systems are being used around the world, which are build using Corpus / Dataset
- Note
- I have listed only few applications of Corpus / Dataset
- In recent years, majority of Real-world AI / ML Applications are Data Driven
- Machine-readable Corpus vs Machine Understandable Corpus
- Machine-readable Corpus
- A corpus which a Machine can read
- Machine Understandable Corpus
- A corpus which a Learner (Machine Learning Algorithm) can use to learn
- Using Corpus for Machine Learning - Common Practice
- The common practice used to Train and Test Machine Learning Algorithms using a Corpus is as follows
- Step 1: Gold Standard (or benchmark) corpus is developed in Machine-readable Format
- Step 2: Machine-readable Corpus is transformed into Machine Understandable Corpus
- Step 3: Machine Learning Algorithms are Trained and Tested on Machine Understandable Corpus
- Types of Corpus / Dataset – Machine Learning
- The Corpus / Dataset used in Machine Learning can be mainly categorized as
- Annotated Corpus
- Unannotated Corpus
- Semi-annotated Corpus
- Annotated Corpus
- Output is associated with all Inputs
- Used for Supervised Learning
- Unannotated Corpus
- Output is not associated with Inputs
- Used for Unsupervised Learning
- Semi-annotated Corpus
- Output is associated with some Inputs
- Used for Semi-supervised Learning
- An annotated / unannotated / semi-annotated corpus can be
- Mono-lingual Corpus
- Multi-lingual Corpus
- Cross-lingual Corpus
- Mono-lingual Corpus
- Definition
- A mono-lingual corpus contains text in only one language
- Examples
- Definition
- Note
- A Mono-lingual Corpus can be annotated, unannotated or semi-annotated
- Note
- Multi-lingual Corpus
- Definition
- A multi-lingual corpus contains text in more than one (or multiple) languages
- Definition
- Examples
- Note
- A Multi-lingual Corpus can be annotated, unannotated or semi-annotated
- Note
- Cross-lingual Corpus
- Definition
- In Cross-lingual Corpus, Source Text is in one language and Target Text is in another language
- Example
- Consider the following Corpus for Urdu-English Machine Translation
- Definition
- Note
- A Cross-lingual Corpus can be annotated, unannotated or semi-annotated
- Note
- Two Main Types of Cross-lingual Corpora
- Comparable Corpus
- Parallel Corpus
- Comparable Corpus
- Definition
- A Comparable Corpus is a collection of similar texts in different languages or in different varieties of a language
- Examples
- Wikipedia
- Contains articles on same topic in different languages
- LOB Corpus (British English)
- Kolhapur Corpus (Indian English)
- Wikipedia
- Important Note
- In Comparable Corpus, the topic of the texts / articles will be same , however, their content will not be same
- Definition
- Parallel Corpus
- Definition
- Parallel Corpus is a corpus that contains a collection of original texts in language L1 and their translations into a set of languages L2 … Ln
- In most cases, Parallel corpora contain data from only two languages
- Types of Parallel Corpus
- A Parallel Corpus can be
- Bi-lingual Parallel Corpus
- A Bi-lingual Parallel Corpus consists of texts of two languages
- Multi-lingual Parallel Corpus
- A Multi-lingual Parallel Corpus consists of texts of more than two languages
- Uni-directional Parallel Corpus
- A Uni-directional Parallel Corpus contains translation in only one direction
- for e.g., Arabic text translated into Urdu
- Bi-directional Parallel Corpus
- A Bi-directional Parallel Corpus contains translations in both directions
- for e.g., Arabic text translated into Urdu and vice versa
- A Bi-directional Parallel Corpus contains translations in both directions
- A Uni-directional Parallel Corpus contains translation in only one direction
- Multi-directional Parallel Corpus
- A Multi-directional Parallel Corpus contains translations in multiple languages
- For e.g., Arabic text (قرآن پاک) translated into Urdu, English, Persian, German, French etc.
- A Multi-directional Parallel Corpus contains translations in multiple languages
- Bi-lingual Parallel Corpus
- A Parallel Corpus can be
- Definition
- Example 1 – Parallel corpus
- Arabic
- Urdu Translation
اللہ ﷻ کے سوا کوئی معبود نہیں اور محمد ﷺ اللہ کے( آخری) رسول ہیں۔
- English Translation
- There is none worthy of worship except Allah, Muhammad (S.A.W.W.) is the (Last) Messenger of Allah
- English Translation
- Example 2 – Parallel corpus
- Arabic
- Urdu Translation
- English Translation
- I solemnly declare my belief in Allah as He is with all His names and attributes, and I have accepted (to obey) all His commands by pledged with my tongue and testifying them with my heart
- English Translation
- Example 3 – Parallel corpus
- Arabic
- Urdu Translation
- English Translation
- I believe in Allah, His Angles, His Books, His Messengers, the Last Day, and the Predestination, that all good and bad is from Allah and I believe in the resurrection after death
- English Translation
TODO and Your Turn
TODO Task 1
- Task 1
- Consider the following tasks
- Text Summarization
- Machine Translation
- Chatbot
- Question Answering
- Plagiarism Detection
- Question
- Which of the above-mentioned tasks can be treated in
- Mono-lingual Settings (using Mono-lingual Corpus)
- Multi-lingual Settings (using Multi-lingual Corpus)
- Cross-lingual Settings (using Cross-lingual Corpus)
- Write Input and Output for each variant of the tasks
- Which of the above-mentioned tasks can be treated in
- TIP
- Task
- Text Reuse Detection
- Variants of Task
- Mono-lingual Text Reuse Detection
- Cross-lingual Text Reuse Detection
- Task
- Consider the following tasks
Your Turn Task 1
- Task 1
- Identify three tasks which can be treated both in
- Mono-lingual Settings (using Mono-lingual Corpus)
- Cross-lingual Settings (using Cross-lingual Corpus)
- Questions
- Write Input and Output for each variant of the tasks
- Identify three tasks which can be treated both in
- Task 2
- Identify two tasks which can be treated all three settings given below
- Mono-lingual Settings (using Mono-lingual Corpus)
- Multi-lingual Settings (using Multi-lingual Corpus)
- Cross-lingual Settings (using Cross-lingual Corpus)
- Questions
- Write Input and Output for each variant of the tasks
- Identify two tasks which can be treated all three settings given below
Data and Annotation
- Advantages of Data Annotation
- As discussed earlier, for more accurate learning, we need
- Annotated Data
- Data Annotation mainly
- adds value to a corpus in that it considerably extends the range of research questions that a corpus can readily address
- Note
- For details on Data Annotations
- See Chapter 2 – Basics of Machine Learning
- See Chapter 2 – Basics of Machine Learning
- For details on Data Annotations
- Example - Advantages of Data Annotation
- Consider the following Unannotated Textual Data
- Unannotated Data / Corpus
- Document 1
- A fly is sitting on the flower
- Document 2
- A bird is flying in the air
- Document 1
- Annotated Data / Corpus (with Part-Of-Speech (POS) Tag)
- Document 1
- A fly (Noun) is sitting on the flower
- Document 2
- A bird is flying (Verb) in the air
- Document 1
- Annotated Data / Corpus is useful to address following research problems
- Identifying the correct POS tag of a word
- Identify the correct meaning of an ambiguous worde. fly (Word Sense Disambiguation Problem)
- Identify the correct meaning of a sentence
- and many more 😊
- Unannotated Data / Corpus
- Annotated Corpus Development Issues
- The main issues in developing a Gold Standard (or benchmark) Corpus / Dataset to Train / Test Machine Learning Algorithms are
- Data Sampling
- Time Span
- Size of Corpus
- Source(s) of Data
- High-quality Data
- Balanced Data
- Data Collection
- Annotation Guidelines
- Annotators
- Data Protections
- Corpus Standardization
- 1: Data Sampling – Annotated Corpus Development Issues
- Population (N)
- Definition
- Total set of observations (or examples) for a Machine Learning Problem
- Collecting data equivalent to the size of Population , will lead to perfect learning
- Definition
- Sample (S)
- Definition
- Subset of observations (or examples) drawn from a Population
- Note
- The size of a Sample is always less than the size of the Population from which it is taken
- Most Important Property of a Sample
- A Sample should be true representative of the Population
- Definition
Example – Population and Sample
- Machine Learning Problem
- Gender Identification
- Population
- Set of all observations (humans) in the world
- Sample
- A set of 5000 observations (humans) drawn from Population
- A Sample should be true representative of the Population
- if in a Population
- 60% are Females and 40% are Males
- Then a Sample (5000 observations) drawn from this Population should have
- 60% Females (3000 observations) and 40% Males (2000 observations)
- if in a Population
Why We Need Data Sampling?
- For Perfect Learning (Ideal Situation)
- Collect all data (or observations / examples) for a Machine Learning Problem
- Problem
- Practically Not Possible
- A Possible Solution (Realistic Situation)
- Draw a Sample from Population which should be its true representative (called a Representative Sample)
- Note
- Since ML Algorithms learn from Sample Data instead of Population Data, that is why
- They have Scope of Error
- Since ML Algorithms learn from Sample Data instead of Population Data, that is why
Representative Sample
- Definition
- A Representative Sample is a subset of a Population that seeks to accurately reflect the characteristics of the Population
- Example
- Machine Learning Problem
- Gender Identification of Undergrad Students studying at COMSATS University, Lahore Campus on 31-03-2020
- Population
- Total number of Undergrad students at COMSATS University, Lahore Campus
- Total = 5000
- Female = 3000 (60%)
- Male = 2000 (40%)
- Note
- Size of Population is known and finite
- A Representative Sample (using Krejcie & Morgan (1970) Formula)
- Size of Sample = 357
- Female = 214 (60%)
- Male = 143 (40%)
- Note
- For above example, three things are fixed
- Type / Genre of Students
- Area
- Time
- For above example, three things are fixed
- Total number of Undergrad students at COMSATS University, Lahore Campus
- Machine Learning Problem
- Table to Determine Sample Size for Finite Population
- Krejcie & Morgan (1970)
Random Sampling Technique
- The most popular and basic Data Sampling technique is Random Sampling Technique
- Definition
- A sampling technique in which each Sample member (observation / example) has an equal probability of being chosen
- Using this approach, you put all the Data in a Bag and randomly pick Sample members one by one
- A Sample chosen randomly is meant to be an unbiased representation of the total Population
Data Sampling and Quality
- To generate Annotated Corpus of high-quality
- Sample Data must be true representative of the Population
- 2: Time Span – Annotated Corpus Development Issues
- Language changes with time
- Consequently, Data changes with time
- Determination of a particular Time Span is required to
- capture the features of a language (Data) within that time span
- Remember
- Thus, when Data changes then
- Model and Error also change
- Thus, when Data changes then
Example – How Data Changes Over Time
- Initially, Facebook only contained mono-lingual comments / posts (mostly English)
- After some time, people started giving comments / posts in different languages (English, Arabic, Urdu, French etc.)
- After some time, people started giving multi-lingual comments / posts (for e.g., English and Roman Urdu)
- Nowadays, Emoji’s are very popular and widely used in Facebook comments / posts
- Note – Today is 30-03-2020 😊
- 3: Size of Corpus – Annotated Corpus Development Issues
- Another very important point to consider in developing an Annotated Corpus
- Rule of Thumb
- The more experience (Data) you have on a task, the better you learn and perform
- Similarly, a Machine Learning Algorithm can be better Trained and Tested if we have large Annotated Corpus
- The more experience (Data) you have on a task, the better you learn and perform
- A corpus should be large because
- wide representation is possible within a larger corpus
- A large corpus should also have the
- scope for regular augmentation
- 4: Sources(s) of Data – Annotated Corpus Development Issues
- Selection of Data Format
- In Gold Standard (or benchmark) Annotated Corpora, Data is stored in Machine-readable Format
- Therefore, try to identify those Source(s) of Data which contain Data in Machine-readable Format
- It will speed up the Annotated Corpus Creation Process
- Selection of Suitable Data Source(s)
- Selection of suitable Data Source(s) mainly depends on the
- Machine Learning Problem
- Examples
- Machine Learning Problem 1
- Gender Identification from Photo
- Machine Learning Problem 2
- Gender Identification from Customers Reviews on Products
- Machine Learning Problem 3
- Gender Identification from Speech
- Note
- Machine Learning Problem 1 requires Image Data
- Machine Learning Problem 2 requires Textual Data
- Machine Learning Problem 3 requires Speech / Audio Data
- The suitable Data Source(s) for all three Machine Learning Problems will be different
- Machine Learning Problem 1
- Selection of suitable Data Source(s) mainly depends on the
- Authenticity of Data Source(s)
- The Data Source(s) used for creating Gold Standard Annotated Corpus should be authentic , otherwise
- Annotated Corpus will not be Gold Standard
- The Data Source(s) used for creating Gold Standard Annotated Corpus should be authentic , otherwise
- Important Questions about Data Source(s)
- Are the targeted Data Source(s) appropriate for your proposed Annotated Corpus?
- Are the targeted Data Source(s) authentic?
- What legal and ethical requirements you will need to fulfill to collect Data?
- What potential challenges you may face in collecting Data from each Data Source?
- How much Data is needed to build a Gold Standard Annotated Corpus and how much data you are expecting to collect from potential Data Source(s)?
- What strategies (or approaches) you will use to collect Data from the Data Source(s)?
Types of Data Sources
- Two main types of Data Sources are
- Sources with Annotations
- Sources without Annotations
- Two main types of Data Sources are
- Sources with Annotations
- Definition
- In this type of Data Sources, the Output is present in the Data along with the Input
- Sources with Annotations can be
- Online Digital Repositories
- Non-digital Repositories
- Existing Corpora
- Definition
- Sources without Annotations
- Definition
- In this type of Data Sources, the Output is not present in the Data along with the Input
- Sources without Annotations can be
- Online Digital Repositories
- Non-digital Repositories
- Existing Corpora
- Definition
Example 1 – Sources with Annotations
- Machine Learning Problem
- News Headline Generation for Urdu News Articles
- Input
- News Article in Urdu
- Output
- Headline
- Note
- Input is of variable length
- Output is of variable length
- Also, length of Output is very small compared to the length of Input
- Potential Data Source with Annotations
- Online Urdu Newspapers
- Reason(s)
- An online Urdu news article contains both
- News Article (Input) and
- Headline (Output)
- An online Urdu news article contains both
- Note
- The potential Data Source contains both Input and Output
- Main Tasks in Developing Gold Standard Annotated Corpus
- Step 1: Extract Urdu news article from the newspaper website (either manually or automatically)
- Step 2: Separate Input and Output
- Example 1 – Sources with Annotations
- Step 3: Standardize Corpus
- Machine Learning Problem
Example 2 – Sources with Annotations
- Machine Learning Problem
- Predict CGPA of Pakistani Undergrad Computer Science students for 1st semester from marks in 9th, 10th, 11th and 12th grades
- Input
- Marks in 9th Grade
- Marks in 10th Grade
- Marks in 11th Grade
- Marks in 12th Grade
- Output
- CGPA in 1st semester of Undergrad
- Range or Output Values
- [0 – 4]
- Note
- Input is Fixed (Set of 4 attributes)
- Output is Numeric and from a fixed range (Regression Problem)
- Potential Data Source with Annotations
- Department of Computer Science, COMSATS University, Lahore Campus
- Reason(s)
- COMSATS University, Lahore Campus was established in 2002 and it has Data of thousands of Undergrad Computer Science students
- Note
- The potential Data Source contains both Input and Output
- Main Tasks in Developing Gold Standard Annotated Corpus
- Step 1: Write an application to the Director of COMSATS University, Lahore Campus and request him to give Data of Computer Science students for research purposes
- Clearly explain your task and what potential benefits it will bring to the COMSATS University and society
- Request to provide Data after anonymizing , to protect the privacy of the students
- Step 2: Extract following five Attributes from the Data provided by COMSATS University, Lahore Campus
- Marks in 9th Grade
- Marks in 10th Grade
- Marks in 11th Grade
- Marks in 12th Grade
- CGPA in 1st semester
- Step 3: Separate Input and Output
- Step 4: Standardize Corpus
- Step 1: Write an application to the Director of COMSATS University, Lahore Campus and request him to give Data of Computer Science students for research purposes
- Machine Learning Problem
Example 1 – Sources without Annotations
- Machine Learning Problem
- Emotion Analysis on Short Texts
- Input
- A Short Text
- Output
- Emotion
- Possible Output Values
- Anger, Anticipation, Disgust, Fear, Joy, Love, Optimism, Pessimism, Sadness, Surprise, Trust, Neutral (or No Emotion)
- Note
- Input is of variable length
- Output is Categorical and from a fixed set of 12 Emotions (Classification Problem)
- Potential Data Source without Annotations
- Tweets from Twitter
- Reason(s)
- The maximum limit of a Tweet is 280 character i.e. a short text
- A huge amount of Data is available on Twitter, which can be used for research purposes
- Note
- The potential Data Source only contains the Input
- Input
- Tweet
- Output
- Emotion is not known
- Main Tasks in Developing Gold Standard Annotated Corpus
- Step 1: Extract Tweets from Twitter (either manually or automatically)
- Step 2: Prepare Annotations Guidelines
- Step 3: Request at least 3 Annotators to manually associate Tweets using the Annotation Guidelines
- Step 4: Compute Inter-Annotator Agreement
- Step 5: Standardize Corpus
Example 2 – Sources with Annotations
- Machine Learning Problem
- Machine Translation
- Input
- Source Text (in English)
- Output
- Target Text (in Urdu)
- Note
- Input is of variable length
- Output is of variable length
- Also, length of Input and length of Output are almost same
- Potential Data Source without Annotations
- English Wikipedia
- Reason(s)
- Wikipedia contains articles on a range of topics , which are very beneficial particularly for academia
- English Wikipedia articles are free and publicly available for research purposes
- Note
- The potential Data Source only contains the Input
- Input
- Text in English Language
- Output
- Translation of English Text in Urdu Language is not known
- Main Tasks in Developing Gold Standard Annotated Corpus
- Step 1: Extract English text from Wikipedia (either manually or automatically)
- Step 2: Ask 1 – 3 Annotators to manually associate: Inpute. manually translate English Text into Urdu language
- Step 3: Standardize Corpus
- Note
- If Manual Translation is generated by only one Annotator, then
- For one Input there will be one Output in the Gold Standard Annotated Corpus
- If Manual Translation is generated by more than one Annotators, then
- Example 2 – Sources without Annotations
- For one Input there will be multiple Outputs in the Gold Standard Annotated Corpus
- If Manual Translation is generated by only one Annotator, then
Comparing Annotations for Emotion Detection and Machine Translation
- Annotations required for Emotion Detection and Machine Translation tasks are quite different
- Therefore, Annotators will need to have different skills to annotate data for Emotion Detection and Machine Translation tasks
- Conclusion
- Completely and correctly understand the annotations before selecting Annotators to annotate Data
- 5: High-quality Data – Annotated Corpus Development Issues
- Rule of Thumb
- Garbage In = Garbage Out
- Similarly, in Machine Learning
- Low Quality Data = Poor Model + High Error
- High Quality Data = Good Model + Low Error
- Important Questions to Ensure Data Quality
- Is your Sample Data, true representative of the Population?
- Is your Sample Data, collected using appropriate Sampling Technique?
- Is your Sample Data complete and correct?
- Is your Sample Data diversifiede. presents all possible variations?
- 6: Balanced Data – Annotated Corpus Development Issues
- To have an unbiased and good Model, the Training Data should be balanced
- Importance of Balanced Data
- Assume Task 1 and Task 2 of same complexity
- On daily basis, you give
- 7 hours to learn Task 1 and
- 1 hour to learn Task 2
- Question
- After one month, what will be your expertise on both tasks?
- Answer
- On Task 1
- You will have good expertise 😊
- On Task 2
- You will not have good expertise
- On Task 1
- Reason
- You had more experience (Data) on Task 1 as compared to Task 2
- Similarly, in Machine Learning
- A Model trained on highly Unbalanced Data, is likely to not perform well on Real-world Data (or unseen data)
Example – Balanced Data
- Machine Learning Problem
- Urdu Text Document (News Articles) Categorization
- Domain Knowledge
- Urdu Newspapers publish articles on daily basis on the following main topics
- Business
- Science and Technology
- Politics
- Religion
- Showbiz
- Sports
- Crime and Court
- Other
- Note
- It is a Multi-class Classification Problem
- Urdu Newspapers publish articles on daily basis on the following main topics
- Size of Proposed Gold Standard Annotated Corpus
- 40,000 News Articles
- A Balanced Annotated Corpus will have the following Data Distribution
- Business = 5, 000
- Science and Technology = 5, 000
- Politics = 5, 000
- Religion = 5, 000
- Showbiz = 5, 000
- Sports = 5, 000
- Crime and Court = 5, 000
- Other = 5, 000
- Note
- No. of instances (News Articles) is same for all 8 classes / categories
- 7: Data Collection – Annotated Corpus Development Issues
- Data Collection
- Definition
- Data collection is the systematic approach to gathering Data from authentic Data Source(s)
- Purpose
- Collected Data is used to build the Gold Standard Annotated Corpus
- Importance
- Accurate Data Collection is essential to build a high-quality Annotated Corpus
- Definition
Selection of Suitable Data Collection Method
- Selection of suitable Data Collection Methods mainly depends upon
- Format of available Data
- For a specific Machine Learning Problem, Real-world Data may be available in 3 formats
- Non-digital Data
- Digital Data
- Both
- Note
- Both Digital and Non-digital Data can be either
- Structured
- Unstructured or
- Semi-structured
- Both Digital and Non-digital Data can be either
- Main Source of Non-digital Data
- Printed (on paper )
- Books
- Images
- Magazines
- News Articles
- Research Papers etc.
- Note
- Non-digital Data is not in Machine-readable Format
- Printed (on paper )
- Main Source of Digital Data
- Online Digital Repositories available through World Wide Web contain huge amount of Digital Data
- Note
- Digital Data is in Machine-readable Format
Example 1 – Data Formats
- Machine Learning Problem
- Automatically Convert Poetry Written by Hazrat Allama Muhammad Iqbal (R.A.) in Digital Format
- Availability of Data
- For the above Machine Learning Problem, Data will be only available in
- Non-digital Format
- For the above Machine Learning Problem, Data will be only available in
Example 2 – Data Formats
- Machine Learning Problem
- Identify احدیث مبارکہ of the Holy Prophet, Hazrat Muhammad S.A.W.W. posted on Twitter
- Availability of Data
- For the above Machine Learning Problem, data will be only available in
- Digital Format
- For the above Machine Learning Problem, data will be only available in
Example 3 – Data Formats
- Machine Learning Problem
- Automatically Differentiate between Poetry of Hazrat Allama Iqbal (R.A.) and Hazrat Khawaja Aziz Ul Hassan Majzoob (R.A.)
- Availability of Data
- For the above Machine Learning Problem, Data is likely to available in both formats
- e., Digital and Non-digital
- For the above Machine Learning Problem, Data is likely to available in both formats
Collecting Non-digital Data
- Remember
- Gold Standard Annotated Corpora are stored in
- Machine-readable Format
- Gold Standard Annotated Corpora are stored in
- Steps to Collect Non-digital Data
- Step 1: Manually collect Non-digital Data from authentic Data Source(s)
- Step 2: Convert the Non-Digital Data into Digital Data
Optical Character Recognition (OCR) System
- Optical Character Recognition (OCR)
- Definition
- OCR is an Intelligent Program that automatically recognizes Text within a Digital Image
- Input to OCR
- A Digital Image containing Text
- Output generated by OCR
- A Digital Text Document
- OCR is an application of Machine Learning 😊
- None of the OCR systems are 100% accurate
- Therefore, we need to manually inspect the document after applying OCR, to remove error(s) generated by OCR
- If a highly accurate OCR is available for a particular language , then
- it should be used to quickly and easily convert Images containing Text into Digital Format
- Errors generated by OCR, can be removed through manual inspection
- Definition
Converting Non-digital Data into Digital Data
- Considering that Non-digital Data is in the form of Text
- Situation 1
- Non-digital Data (Text) is printed on a paper
- Conversion Approach
- A human will use a keyboard to type the Text in Digital Format
- Situation 2
- Non-digital Data (Text) is in the form of Images
- Conversion Approach 1 – Manual Approach
- A human will use a keyboard to type the Text contained by Image in the Digital Format
- Strengths
- Manual Approach is accurate
- Weaknesses
- Manual Approach is slow (requires a lot time, cost and effort)
- Conversion Approach 2 – Semi-automatic Approach
- A Two Step Process
- Step 1: Apply an accurate Optical Character Recognition (OCR) System on Image and OCR will automatically convert it into Digital Data
- Step 2: Manually inspect the Digital Data returned by OCR to fix the errors (generated by OCR)
- Strengths
- Semi-automatic Approach is accurate
- Semi-automatic Approach is fast compared to Manual Approach
- Weaknesses
- Semi-automatic Approach is still slow (and requires time, cost and effort) because it requires manual inspection to correct errors generated by OCR
- A Two Step Process
- Conversion Approach 3 –Automatic Approach
- Apply an accurate Optical Character Recognition (OCR) System on Image and OCR will automatically convert it into Digital Data
- Strengths
- Automatic Approach is fast
- Weaknesses
- Automatic Approach is not very accurate
Selection of Approach to Solve a Real-world Problem
- To solve any Real-world Problem, we have three main approaches
- Manual Approach
- only involves Human
- Semi-automatic Approach
- involves both Machine and Human
- Automatic Approach
- only involves Machine
- Manual Approach
- It is important to select the most suitable approach to solve a Real-world Problem
Collecting Digital Data
- Data in Digital Format is mainly collected using
- Semi-automatic Approaches
- Automatic Approaches
- Selection of suitable Data Collection Approach depends on the targeted Data Source(s)
- A Data Source may contain
- Data can be accessed freely and publicly
- Data that is free and publicly available to create Gold Standard Annotated Corpus for research purposes
- For e.g. Wikipedia, News Articles etc.
- Data that is free and publicly available to create Gold Standard Annotated Corpus for research purposes
- Data can be viewed freely and publicly
- However, to use Data for creating Gold Standard Annotated Corpus, you will need permission and / or access from the Owner
- For e.g. Facebook, Twitter etc.
- However, to use Data for creating Gold Standard Annotated Corpus, you will need permission and / or access from the Owner
- Data is not publicly available
- Data is owned by an individual and to use the Data to create Gold Standard Annotated Corpus for research purposes you will need to
- Take permission form the individual and
- Anonymize the identity of the individual
- for e.g., SMS messages in a person’s mobile phone
- Data is owned by an individual and to use the Data to create Gold Standard Annotated Corpus for research purposes you will need to
- Data can be accessed freely and publicly
Examples – Collecting Digital Data
- Task 1
- Create a Gold Standard Annotated Corpus using Wikipedia
- A Possible Data Collection Approach
- Step 1: Download the latest dump of Wikipedia
- Step 2: Extract the required Data from the Wikipedia Dump using an Wikipedia Dump Parser
- Note
- A Wikipedia Dump Parser in Python can be found from the following link
- URL: https://pypi.org/project/wiki-dump-parser/
- Task 2
- Create a Gold Standard Annotated Corpus using Twitter / Facebook Data
- A Possible Data Collection Approach
- Step 1: Take permission from Twitter / Facebook to collect publicly available Data from their website
- Step 2: Use Twitter / Facebook API to collect your required Data from their website
- Task 3
- Create a Gold Standard Annotated Corpus using News Articles
- A Possible Data Collection Approach
- Step 1: Write a Web Crawler
- Step 2: Extract required Data from news websites using the Web Crawler developed in Step 1
- Task 4
- Create a Gold Standard Annotated Corpus using SMS messages
- A Possible Data Collection Approach
- Step 1: Meet potential contributors and explain the purpose of your research to them
- Build their trust on you and ensure them that their privacy will be protected
- Step 2: Take consent (both verbal and written) form the contributors that you can use their Data for research purposes
- Step 3: Ask contributors to share their SMS messages (Data) with you through email, Google Form
- Step 1: Meet potential contributors and explain the purpose of your research to them
- Note
- For all four Tasks discussed above
- The Potential Data Sources and Tools / Techniques used to collect Data are different
- For all four Tasks discussed above
- Conclusion
- Before starting Data Collection, you must have complete and correct understanding of
- Targeted Data Sources and
- Tools / Techniques to be used to collect Data from Targeted Data Sources
- Before starting Data Collection, you must have complete and correct understanding of
- 8: Annotation Guidelines – Annotated Corpus Development Issues
- Definition
- The set of rules / guidelines that help an Annotator to assign the most appropriate Output to an Input
- Importance
- Annotation Guidelines are the backbone of Annotation Process
- Low-quality or inappropriate Annotation Guidelines will result in a low-quality Annotated Corpus and vice versa
- 9: Annotators – Annotated Corpus Development Issues
- Two Main Situations
- Annotations are Present in Raw Data
- Annotations are Not Present in Raw Data
Annotations are Present in Raw Data
- When annotations are present in the Raw Data, then we
- Don’t need Annotators / Taggers
- Example
- Machine Learning Problem
- Automatically Generate Headline of an Urdu News Article
- Input
- An Urdu News Article
- Output
- Headline
- Availability of Raw Data
- Table below shows an example of Raw Data (an Urdu News Article (Input) with Headline (Output))
- Machine Learning Problem
ساؤتھ ایشین گیمز ہاکی مقابلے میں پاکستانی ٹیم نے بھارت کو شکست دیدی
ساؤتھ ایشین گیمز کے ہاکی ایونٹ میں پاکستانی ٹیم ابتدا سے ہی جارحانہ انداز میں نظر آئی اور بھارت خلاف پے درپے حملوں کا سلسلہ جاری رکھا جس کی وجہ سے پہلے ہی ہاف میں پاکستان نے بھارت کے خلاف 2 گول داغ کر برتی حاصل کرلی ۔ پاکستان کی جانب سے پہلا گول فرید احمد جب کہ دوسرا ارسلان قادر نے کیا ۔ بھارتی ٹیم دوسرے ہاف میں میچ بچانے اور مقابلہ برابر کرنے کی بھرپور کوششوں میں مصروف رہی لیکن پاکستانی ہاکی ٹیم کے دفاعی کھلاڑیوں نے حریف ٹیم کے تمام حملے ناکام بنادیے تاہم بھارتی ٹیم صرف ایک ہی گول کرپائی.
Annotations are Not Present in Raw Data
- When annotations are not present in the Raw Data, then we
- Need Annotators / Taggers to manually annotate Raw Data
- Since manual annotation is a laborious and time-consuming task
- Finding appropriate Domain Experts is a challenging task
- Standard Practice for Annotations
- Use at least 3 Annotators to annotate Data according to Annotation Guidelines
- All Annotators must be Domain Experts
- Two Main Variations in Annotations
- When Output is of
- Fixed Length (from a Fixed Set of Values )
- When Output is of
- Variable Length
- When Output is of
Example – Annotations When Output is of Fixed Length
- Machine Learning Problem
- Emotion Analysis on Tweets in English Language
- Input
- Tweet
- Output
- Emotion
- Possible Output Values (12 Categories / Classes)
- Anger, Anticipation, Disgust, Fear, Joy, Love, Optimism, Pessimism, Sadness, Surprise, Trust, Neutral (or No Emotion)
- Availability of Raw Data (Tweets) from Twitter
- I collected the following Tweets from Twitter to develop my Gold Standard Annotated Corpus
- Star trek online has a update to download oh fuming yay
- It’s basically a dead skin peel which sounds grim. But it literally gets rid of so much dead skin from your pores.
- These #NewEnglandPatriots jerseys look like some whack ones that you tried to make when you made custom jerseys on #madden
- @WaterboysAS I would never strategically vote for someone I don’t agree with. A lot of the Clinton vote based on fear and negativity.
- Solid starts by #Bozzelli and #BenEvans. Hoping for a good #start!
- I collected the following Tweets from Twitter to develop my Gold Standard Annotated Corpus
- To add annotations to Tweets, we need at least 3 Annotators (A, B and C)
- Annotators A and B will annotate the Data and
- Annotator C will do the Conflict Resolution
- In Gold Standard Annotated Corpus
- Each Tweet (Input) will have one single Emotion (Output)
- Note
- In Sha Allah , In the next Chapter, I will try to explain the Annotation Process in detail
Example – When Output is of Variable Length
- Machine Learning Problem
- Machine Translation from Urdu to English
- Input
- Source Text in English
- Output
- Target Text in Urdu
- Availability of Raw Data
· دنیا کی محبت تمام خطاؤں کی جڑ ہے · دنیا امتحان کی جگہ ہے اطمینان کی جگہ نہیں ہے · یہ زندگی ہے مختصر لیکن ہے بہت قیمتی |
- Following the Standard Practice for Annotations
- Three Annotators annotated each Input with Output
- Annotations
- Input 1
- Outputs
- Annotator 1
- The love of the world is the root of all sins
- Annotator 2
- The love of the world is the main root cause of all mistakes
- Annotator 3
- The love of the world is the root cause of all disasters
- Annotator 1
- Input 2
- Outputs
- Annotator 1
- The world is a place of examination, it is not a place of satisfaction
- Annotator 2
- This world is a place of examination, not a place of comfort
- Annotator 3
- The world is a place of examination, it is not a place of peace
- Annotator 1
- Input 3
- Annotator 1
- This life is short but very valuable
- Annotator 2
- This life is short but it is very precious
- Annotator 3
- This life is short but it is very treasured
- Annotator 1
- In Gold Standard Annotated Corpus
- Each Urdu Text (Input) will have three English Translations (Outputs)
- Note
- In sha Allah, In the next Chapter, I will try to explain the Annotation Process in detail
- 10: Data Protection - Annotated Corpus Development Issues
- Data Protection
- Definition
- Data Protection is the process of safeguarding important information from corruption, compromise or loss
- Purpose
- Safeguard important information from misuse
- Importance
- In recent years, the amount of data created and stored continues to grow at unprecedented rates , which makes it very important to safeguard important information
- Definition
Data Protection and Annotated Corpus Development
- Important Note
- I am discussing Annotated Corpus Development for research purposes only
- Data used for developing a Gold Standard Annotated Corpus can be
- Public Data
- Private Data
- Both
- Public Data
- To use this type of Data, we don’t need to take explicit permission from the Owner of Data
- Example
- News Articles Published on World Wide Web
- Public Tweets on Twitter
- Public Comments / Posts on Facebook
- Free and Publicly Available Online Digital Repositories (for e.g., Wikipedia)
- Private Data
- To use this type of data, we need to
- take explicit permission from the Owner of Data to use his / her Data for research purposes
- anonymize participants / contributors personal information during Data Collection
- Example
- SMS messages
- Private Tweets on Twitter
- Private Comments / Posts on Facebook
- Websites which Publish Copy Righted Material on World Wide Web
- To use this type of data, we need to
- Example – Corpus containing both Private and Public Data
- Author Profiles containing both private and public comments / posts of a Facebook user
- See the following paper for detail
- Fatima, K. Hasan, S. Anwar and R. Nawab (2017), Multi-lingual Author Profiling on Facebook , Information Processing & Management, Elsevier
- See the following paper for detail
- Author Profiles containing both private and public comments / posts of a Facebook user
- Important Note
License and Annotated Corpus Development
- Corpus License
- Definition
- A Corpus License means the terms and conditions for use , reproduction , and distribution of the corpus
- Purpose
- To clearly define, how a corpus can be used
- Definition
- Making an Annotated Corpus Publicly Available
- Simply write the following statement, in Footnote of your research paper
- Our proposed corpus is free and publicly available for research purposes
- Simply write the following statement, in Footnote of your research paper
- License Agreement Details
- Please see the following useful link
- URL: https://creativecommons.org/licenses/
- Corpus License
- 11: Corpus Standardization – Annotated Corpus Development Issues
- For Machine Learning, a corpus should be standardized in
- Machine-readable Format
- Three Main Formats
- Plain Text Format
- CSV Format
- XML Format
- Main characteristics of a Gold Standard Annotated Corpus are as follows
- Sample Data should be true representative of Population
- A Representative Sample should have
- Large Amount of Data
- High-quality Data
- Balanced Data
- Data Sources should be
- Appropriate and
- Authentic
- For Data Collection, all legal and ethical requirements should be fulfilled
- Standard Tools / Techniques should be used for Data Collection
- Corpus Generation Process must be standard and state-of-the-art
- There should be at least three Annotators
- Annotators must be Domain Experts
- Annotation Guidelines should be clearly describe the annotation criteria
- Inter-Annotator Agreement and Kappa Statistics must be good
- Corpus must be standardized properly using Standard Formats
- License of Corpus should be clearly mentioned
- Main Characteristics of a Gold Standard Annotated Corpus
- Corpus Characteristics should be clearly mentioned
TODO and Your Turn
TODO Task 3
- Task 1
- Qamar wants to develop a Gold Standard Annotated Corpus for Sentiment Analysis task. His Research Focus is on Urdu Comments / Reviews on Products. The problem of Urdu Sentiment Analysis is treated as Multi-class Classification Problem i.e. there are three classes: Positive, Negative and Neutral. While developing Gold Standard Annotated Corpus Qamar is unable to find answers to the following questions. Your task is to find out answers to questions and share with Qamar
- Note
- Each Answer must be
- Well Justified
- Each Answer must be
- Questions
- Data Sampling
- How Qamar should collect a Representative Sample, if the Population size is 100K?
- What Sampling Technique will be most suitable?
- Time Span
- What will be a good Time Span to collect Sample Data?
- Size of Corpus
- How much Data will be good enough to Train / Test Machine Learning Algorithms?
- Source(s) of Data
- Write down potential Data Sources?
- How will you ensure the authenticity of the Data Sources?
- Why do you think that your selected Data Sources are appropriate for Urdu Sentiment Analysis task?
- High-quality Data
- What steps you will follow to ensure quality in Data?
- What techniques will you use to bring quality in your Data?
- Balanced Data
- How will you ensure that Data is balanced?
- Data Collection
- What Tools / Techniques you will use for Data Collection?
- Annotation Guidelines
- How will you prepare the Annotation Guidelines?
- Write down Annotation Guidelines for Urdu Sentiment Analysis?
- TIP
- To prepare Annotation Guidelines for Urdu Sentiment Analysis task, see Annotation Guidelines for Sentiment Analysis in existing research papers
- For Example, see the following paper
- Mahmood , I. Safder, R. Nawab, F. Bukhari, S. Alelyani, S. Hassan, N. Aljohani, R. Nawaz (2020), Deep Sentiments in Roman Urdu Text using Recurrent Convolutional Neural Network Model , Information Processing & Management, Elsevier
- Annotators
- How many Annotators should be used to annotate data?
- What should be the main characteristics of Annotators?
- Explain how Inter-Annotator Agreement and Kappa Statistics will be calculated?
- Data Protections
- What License will be most appropriate to release the Gold Standard Annotated Corpus?
- Corpus Standardization
- What format will be most suitable to standardize the Gold Standard Annotated Corpus?
- Main Characteristics of a Gold Standard Annotated Corpus
- Do your solutions for various Corpus Development Issues will result in a Gold Standard Annotated Corpus?
- TIP
- Make a Table and check if the proposed Urdu Sentiment Analysis Corpus, will contain all the characteristics of a Gold Standard Annotated Corpus or not?
- Data Sampling
Your Turn Task
- Task 1
- Identify a Real-world Problem, for which you want to develop a Gold Standard Annotated Corpus and answer the questions given below.
- Note
- Each Answer must be
- Well Justified
- Each Answer must be
- Questions
- Data Sampling
- How Qamar should collect a Representative Sample, if the Population size is 100K?
- What Sampling Technique will be most suitable?
- Time Span
- What will be a good Time Span to collect Sample Data?
- Size of Corpus
- How much Data will be good enough to Train / Test Machine Learning Algorithms?
- Source(s) of Data
- Write down potential Data Sources?
- How will you ensure the authenticity of the Data Sources?
- Why do you think that your selected Data Sources are appropriate for Urdu Sentiment Analysis task?
- High-quality Data
- What steps you will follow to ensure quality in Data?
- What techniques will you use to bring quality in your Data?
- Balanced Data
- How will you ensure that Data is balanced?
- Data Collection
- What Tools / Techniques you will use for Data Collection?
- Annotation Guidelines
- How will you prepare the Annotation Guidelines?
- Write down Annotation Guidelines for Urdu Sentiment Analysis?
- TIP
- To prepare Annotation Guidelines for Urdu Sentiment Analysis task, see Annotation Guidelines for Sentiment Analysis in existing research papers
- For Example, see the following paper
- Mahmood , I. Safder, R. Nawab, F. Bukhari, S. Alelyani, S. Hassan, N. Aljohani, R. Nawaz (2020), Deep Sentiments in Roman Urdu Text using Recurrent Convolutional Neural Network Model , Information Processing & Management, Elsevier
- Annotators
- How many Annotators should be used to annotate data?
- What should be the main characteristics of Annotators?
- Explain how Inter-Annotator Agreement and Kappa Statistics will be calculated?
- Data Protections
- What License will be most appropriate to release the Gold Standard Annotated Corpus?
- Corpus Standardization
- What format will be most suitable to standardize the Gold Standard Annotated Corpus?
- Main Characteristics of a Gold Standard Annotated Corpus
- Do your solutions for various Corpus Development Issues will result in a Gold Standard Annotated Corpus?
- TIP
- Make a Table and check if the proposed Urdu Sentiment Analysis Corpus, will contain all the characteristics of a Gold Standard Annotated Corpus or not?
- Data Sampling
Chapter Summary
- Chapter Summary
In this Chapter, I presented the following main concepts:
- A Real-world Problem is defined as a matter or situation regarded as unwelcome or harmful and needing to be dealt with and overcome
- To systematically identify the most suitable Solution to a Real-world Problem, use the following Step by Step approach
- Step 1: Write done the Real-world Problem and your current circumstances
- Step 2: Completely and correctly understand the situation
- Step 3: List down the Possible Solutions that you know
- Step 4: Consult Domain Experts and people who faced similar Real-world Problem in the past and update your List of Possible Solutions
- Step 5: Write down strengths and weaknesses of each Possible Solution
- Step 6: Shortlist and rank 3 Possible Solutions, that seem to be most suitable Solutions
- Step 7: Consult a Domain Expert and select the one which seems to be most suitable to solve the Real-world Problem in your current situation
- Step 8: Apply the selected Solution to solve your Real-world Problem
- To systematically identify the most suitable Solution to a Real-world Problem, use the following Step by Step approach
- If (Real-world Problem can be broken into Input and Output)
- Then
- You can treat that Real-world Problem as a Machine Learning Problem 😊
- Then
- Data = Model + Error
- This shows that Data is backbone of Machine Learning
- i.e., why Machine Learning Approach is also called Data Driven Approach
- This shows that Data is backbone of Machine Learning
- To develop Intelligent Programs (or Models) using Machine Learning Approach (or Data Driven Approach), we need
- Large amount of Data
- High-quality Data
- Balanced Data
- Corpus / Dataset is defined as a collection of Real-world Data in Machine-readable Format
- Machine-readable Corpus is
- a corpus which a Machine can read
- Machine Understandable Corpus is
- a corpus which a Learner (Machine Learning Algorithm) can use to learn
- The common practice used to Train and Test Machine Learning Algorithms using a Corpus is as follows
- Step 1: Gold Standard (or benchmark) corpus is developed in Machine-readable Format
- Step 2: Machine-readable Corpus is transformed into Machine Understandable Corpus
- Step 3: Machine Learning Algorithms are Trained and Tested on Machine Understandable Corpus
- The Corpus / Dataset used in Machine Learning can be mainly categorized as
- Annotated Corpus
- Output is associated with all Inputs
- Used for Supervised Learning
- Unannotated Corpus
- Output is not associated with Inputs
- Used for Unsupervised Learning
- Semi-annotated Corpus
- Output is associated with some Inputs
- Used for Semi-supervised Learning
- An annotated / unannotated / semi-annotated corpus can be
- Mono-lingual Corpus
- Text comprises of only one language
- Multi-lingual Corpus
- Text comprises of more than one languages
- Cross-lingual Corpus
- Comparable Corpus
- A Comparable Corpus is a collection of similar texts in different languages or in different varieties of a language
- Parallel Corpus
- Parallel Corpus is a corpus that contains a collection of original texts in language L1 and their translations into a set of languages L2 … Ln
- Comparable Corpus
- Mono-lingual Corpus
- In Comparable Corpus, the topic of the texts / articles will be same , however, their content will not be same
- In most cases, Parallel corpora contain data from only two languages
- A Parallel Corpus can be
- Bi-lingual Parallel Corpus
- A Bi-lingual Parallel Corpus consists of texts of two languages
- Multi-lingual Parallel Corpus
- A Multi-lingual Parallel Corpus consists of texts of more than two languages
- Uni-directional Parallel Corpus
- A Uni-directional Parallel Corpus contains translation in only one direction
- for e.g., Arabic text translated into Urdu
- Bi-directional Parallel Corpus
- A Bi-directional Parallel Corpus contains translations in both directions
- for e.g., Arabic text translated into Urdu and vice versa
- Multi-directional Parallel Corpus
- A Multi-directional Parallel Corpus contains translations in multiple languages
- For e.g., Arabic text (Quran.e.Pak) translated into Urdu, English, Persian, German, French etc.
- Bi-lingual Parallel Corpus
- For more accurate learning, we need to have
- Annotated Data
- Data Annotation mainly
- adds value to a corpus in that it considerably extends the range of research questions that a corpus can readily address
- The main issues in developing a Gold Standard (or benchmark) Corpus / Dataset to Train / Text Machine Learning Algorithms are
- Data Sampling
- Time Span
- Size of Corpus
- Source(s) of Data
- High-quality Data
- Balanced Data
- Data Collection
- Annotation Guidelines
- Annotators
- Data Protections
- Corpus Standardization
- Population (N)
- Total set of observations (or examples) for a Machine Learning Problem
- Collecting data equivalent to the size of Population, will lead to perfect learning
- Sample (S)
- Subset of observations (or examples) drawn from a Population
- The size of a Sample is always less than the size of the Population from which it is taken
- Most Important Property of a Sample
- A Sample should be true representative of the Population
- A Representative Sample is a subset of a Population that seeks to accurately reflect the characteristics of the Population
- To generate Annotated Corpus of high-quality
- Sample Data must be true representative of the Population
- Language changes with time
- Consequently, Data changes with time
- Determination of a particular Time Span is required to
- capture the features of a language (Data) within that time span
- Remember
- Data = Model + Error
- Thus, when Data changes then
- Model and Error also change
- A Machine Learning Algorithm can be better Trained and Tested if we have large Annotated Corpus
- A corpus should be large because
- wide representation is possible within a larger corpus
- A large corpus should also have the
- scope for regular augmentation
- Data Source(s) used for creating Gold Standard Annotated Corpus should be authentic, appropriate and preferably in Digital Format.
- Two main types of Data Sources are
- Sources with Annotations
- In this type of Data Sources, the Output is present in the Data along with the Input
- Sources without Annotations
- In this type of Data Sources, the Output is not present in the Data along with the Input
- Sources with / without Annotations can be
- Online Digital Repositories
- Non-digital Repositories
- Existing Corpora
- Sources with Annotations
- Skill set required for annotations varies form ML Problem to ML Problem
- Completely and correctly understand the annotations before selecting Annotators to annotate Data
- In Machine Learning
- Low Quality Data = Poor Model + High Error
- High Quality Data = Good Model + Low Error
- Sample Data should be complete, correct, diversified and true representative of the Population?
- To have an unbiased and good Model, the Training Data should be balanced
- Data collection is the systematic approach to gathering Data from authentic Data Source(s)
- Accurate Data Collection is essential to build a high-quality Annotated Corpus
- Three main approaches to Data Collection are
- Manual Approach
- only involves Human
- Semi-automatic Approach
- involves both Machine and Human
- Automatic Approach
- only involves Machine
- Manual Approach
- Data in Digital Format is mainly collected using
- Semi-automatic Approaches
- Automatic Approaches
- A Data Source may contain
- Data can be accessed freely and publicly
- Data can be viewed freely and publicly
- Data is not publicly available
- Annotation Guidelines is the set of rules / guidelines that help an Annotator to assign the most appropriate Output to an Input
- Annotation Guidelines are the backbone of Annotation Process
- Low-quality or inappropriate Annotation Guidelines will result in a low-quality Annotated Corpus and vice versa
- Two Main Situations in Data Annotations
- Annotations are Present in Raw Data
- Don’t need Annotators / Taggers
- Annotations are Not Present in Raw Data
- Need Annotators / Taggers to manually annotate Raw Data
- Annotations are Present in Raw Data
- Standard Practice for Annotations
- Use at least 3 Annotators to annotate Data according to Annotation Guidelines
- All Annotators must be Domain Experts
- Data Protection is the process of safeguarding important information from corruption, compromise or loss
- Data used for developing a Gold Standard Annotated Corpus can be
- Public Data
- To use this type of Data, we don’t need to take explicit permission from the Owner of Data
- Private Data
- To use this type of data, we need to take explicit permission from the Owner of Data to use his / her Data for research purposes
- Both
- Public Data
- A Corpus License means the terms and conditions for use , reproduction , and distribution of the corpus
- For Machine Learning, a corpus should be standardized in
- Machine-readable Format
- Three Main Formats
- Plain Text Format
- CSV Format
- XML Format
- Main characteristics of a Gold Standard Annotated Corpse are as follows
- Sample Data should be true representative of Population
- A Representative Sample should have
- Large Amount of Data
- High-quality Data
- Balanced Data
- Data Sources should be
- Appropriate and
- Authentic
- For Data Collection, all legal and ethical requirements should be fulfilled
- Standard Tools / Techniques should be used for Data Collection
- Corpus Generation Process must be standard and state-of-the-art
- There should be at least three Annotators
- Annotators must be Domain Experts
- Annotation Guidelines should be clearly describe the annotation criteria
- Inter-Annotator Agreement and Kappa Statistics must be good
- Corpus must be standardized properly using Standard Formats
- License of Corpus should be clearly mentioned
- Corpus Characteristics should be clearly mentioned
In Next Chapter
- In Next Chapter
- In Sha Allah, in the next Chapter, I will present a detailed discussion on
- Data and Annotation – Step by Step Examples