1. Problem Statement
- Emails relevant to bookkeeping (invoices, statements, etc )
- Emails relevant to servers (server down, monthly uptime report)
- SPAM or irrelevant emails (Viagra, watch, loto). While mailbox has tools some always pass.
- Emails that can wait (news, magazines, etc. Nice to read but not essentials to productive work)
- Email that should be associated to a task, project or client
- As per Email as a first-class citizen: "Instead of one gigantic mail store, we should have a number of smaller ones that make sense to one's workflow (ex.: around projects, tasks, clients, etc.) and that can easily be shared and prioritized."
2. Related Work Analysis
Platform | Algorithm
|
NextCloud | Gaussian Naive Bayes
|
Google | Logistic Regression and Neural Networks
|
Outlook | ?
|
Yahoo | ? |
2.1. Findings So Far
- Many research papers suggest Support Vector Machines (SVM) to outperform the rest of the ML algorithms for email classification. However, the leading platforms have not adopted it. Some more research will be required that will focus on,
- Data set to be used
- Supervised or Unsupervised Learning (I think Supervised is better)
- Tiki compatibility
- SVM algorithm had the longest training time
- SVM algorithm with optimized parameters had the highest accuracy score
- Naive Bayes algorithm had the quickest predicting time
- Important Stuff to Keep in Mind
- NextCloud claims to have used the local data to ensure user's privacy. Explore this more and find out how, and why?
- Reference links:
- https://nextcloud.com/blog/nextcloud-mail-introduces-machine-learning-for-priority-inbox/
- https://www.sciencedirect.com/science/article/pii/S2405844018353404
- https://towardsdatascience.com/the-best-machine-learning-algorithm-for-email-classification-39888e7b1846
- https://slidetodoc.com/a-study-of-supervised-spam-detection-applied-to/
3. Proposed Strategy So Far
Goal: Classify emails based on Projects or anything user wants to filter or classify emails on.
- The result of Auto Classification and Auto Filters is same. Though, the classification model can learn and adapt according to user's feedback, whereas if the filtered result is wrong in Auto Filters, then the user will simply have to live with that or perform manual alterations to the generated folder, or the code might have to get altered.
- There will be three kinds of folders:
- By default: Inbox, Sent, Drafts, Trash, etc.
- Made by users
- Made by Machine Learning algorithms
4. Implementation Strategy
4.1. Tasks for Phase 1
- UI Designing
- Wireframing
- Mockups
- Find and finalize the best Searching tool
4.2. Tasks for Phase 2
- Text Sanitization Mechanism
- Tokenizer and Stemmers
4.3. Tasks for Phase 3
- Make a list of top 3 best algorithms/classifiers (that are able to learn)
- Explore and finalize best Bag of Words technique
- Data set preparation,
- Get email data from Cypht
- Data preprocessing
- Train model
- Evaluate and compare the best model
- Integrate the best model in Cypht for email classification
5. PHASE 1 - Manual Filters
5.1. Expected Features
- Ability to apply the provided filters, such as,
- Unread
- Sent to me
- ?
- Ability to apply customized filters (like Outlook's rules)
- Ability to save customized filters
5.2. Wireframe
5.3. Tools and Technology
Sieve and Ingo will be used to apply the email filtering. Ingo is the "Email Filter Rules Manager", started as a frontend for the Sieve filter language. Following diagram shows the high-level architecture of the proposed solution.
The server side filtering will be done by Sieve. The managesieve service will make it possible to connect with Ingo to create, edit, enable and delete filter rules. For more details, refer to https://www.skrilnetz.net/server-side-mail-filtering-with-horde-ingo-and-sieve/
For more details on Sieve, refer to http://sieve.info/
For more details on Ingo, refer to https://github.com/horde/ingo and https://www.horde.org/apps/ingo
6. PHASE 2 - Auto Filters
6.1. Expected Features
- Filter spam email
- Filter important emails
- Integrate an external third-party spam filtering tool? (reference: https://www.quora.com/What-major-email-providers-filter-spam-well)
7. PHASE 3 - Auto Classification
7.1. Expected Features
- Filter spam email (point 1-4 are like Google's)
- Filter important emails
- Filter updates
- Filter promotions
- Make folders based on projects, clients
- Learn and adapt on user's feedback and emails that they mostly search