Regex/code for removing "FWD", "RE", etc, from email subject
The following regex will match all of the cases in the way that I would expect it to do so. I'm not sure if you will agree, because not every case has been explicitly documented. It is almost certainly possible to simplify this, but it is functional:
/^((\[(re|fw(d)?)\s*\]|[\[]?(re|fw(d)?))\s*[\:\;]\s*([\]]\s?)*|\(fw(d)?\)\s*)*([^\[\]]*)[\]]*/i
The final result in the match will be the stripped subject.
Several variations (Subject Prefix) according to the country/language: Wikipedia: List of email subject abbreviations
Brazil: RES === RE, German: AW === RE
Example in Python:
#!/usr/local/bin/python
# -*- coding: utf-8 -*-
import re
p = re.compile( '([\[\(] *)?(RE?S?|FYI|RIF|I|FS|VB|RV|ENC|ODP|PD|YNT|ILT|SV|VS|VL|AW|WG|ΑΠ|ΣΧΕΤ|ΠΡΘ|תגובה|הועבר|主题|转发|FWD?) *([-:;)\]][ :;\])-]*|$)|\]+ *$', re.IGNORECASE)
print p.sub( '', 'RE: Tagon8 Inc.').strip()
Example in PHP:
$subject = "主题: Tagon8 - test php";
$subject = preg_replace("/([\[\(] *)?(RE?S?|FYI|RIF|I|FS|VB|RV|ENC|ODP|PD|YNT|ILT|SV|VS|VL|AW|WG|ΑΠ|ΣΧΕΤ|ΠΡΘ|תגובה|הועבר|主题|转发|FWD?) *([-:;)\]][ :;\])-]*|$)|\]+ *$/im", '', $subject);
var_dump(trim($subject));
Terminal:
$ python test.py
Tagon8 Inc.
$ php test.php
string(17) "Tagon8 - test php"
Note: This is the Regular Expression of mathematical.coffee. Added other prefixes from other languages: Chinese, Danish Norwegian, Finnish, French, German, Greek, Hebrew, Italian, Icelandic, Swedish, Portuguese, Polish, Turkish
I used "strip/trim" to remove spaces
Try this one (replace with ''):
/([\[\(] *)?(RE|FWD?) *([-:;)\]][ :;\])-]*|$)|\]+ *$/igm
(If you put each subject through as its own string then you don't need the m
modifier; this is just so that $
matches end of line, not just end of string, for multiline string inputs).
See it in action here.
Explanation of regex:
([\[\(] *)? # starting [ or (, followed by optional spaces
(RE|FWD?) * # RE or FW or FWD, followed by optional spaces
([-:;)\]][ :;\])-]*|$) # only count it as a Re or FWD if it is followed by
# : or - or ; or ] or ) or end of line
# (and after that you can have more of these symbols with
# spaces in between)
| # OR
\]+ *$ # match any trailing \] at end of line
# (we assume the brackets () occur around a whole Re/Fwd
# but the square brackets [] occur around the whole
# subject line)
Flags.
i
: case insensitive.
g
: global match (match all the Re/Fwd you can find).
m
: let the '$' in the regex match end of line for a multiline input, not just end of string (only relevant if you feed in all your input subjects to the regex at once. If you feed in one subject each time then you can remove it because end of line is end of string).