Title |
Test
Find
English Sentence Matching
|
Expression |
\b(((["'/,&%\:\(\)\$\+\-\*\w\000-\032])|(-*\d+\.\d+[%]*))+[\s]+)+\b[\w"',%\(\)]+[.!?](['"\s]|$) |
Description |
Focused on scraping English sentences from HTML/Java (without having to parse).
Correctly matches the vast majority of English sentences. There are undoubtedly a number of cases which do not match, but I felt they were oblique enough to be omitted.
(Surely, the fellow that commented on this script had some sentences not match, but the example he describes does correctly match, and I provide it as the fourth example.)
Cheers |
Matches |
This is an example. | "Matching sentence." | A 9.7% increase over the last 10+ years. | The vehicle has a 5.2 liter, four-wheel drive engine. |
Non-Matches |
Class.Function |
Author |
Rating:
Scotty
|
Source |
Myself |
Your Rating |
|
Title: PHP?
Name: Noni
Date: 9/8/2010 10:49:09 AM
Comment:
Any idea on how to get this to work in PHP with preg_match?
Title: Edited to include sentences with square brackets in them
Name: cam8001
Date: 4/28/2008 12:35:40 AM
Comment:
This choked on a sentence with square brackets in it, so I've edited it just a little to fix that.
\b(((["'/,&%\[\]\:\(\)\$\+\-\*\w\000-\032])|(-*\d+\.\d+[%]*))+[\s]+)+\b[\w"',%\(\)]+[.!?](['"\s]|$)
Title: Edited to include sentences with square brackets in them
Name: cam8001
Date: 4/28/2008 12:35:08 AM
Comment:
This choked on a sentence with square brackets in it, so I've edited it just a little to fix that.
\b(((["'/,&%\[\]\:\(\)\$\+\-\*\w\000-\032])|(-*\d+\.\d+[%]*))+[\s]+)+\b[\w"',%\(\)]+[.!?](['"\s]|$)
Title: Edited to include sentences with square brackets in them
Name: cam8001
Date: 4/28/2008 12:34:22 AM
Comment:
This choked on a sentence with square brackets in it, so I've edited it just a little to fix that.
\b(((["'/,&%\[\]\:\(\)\$\+\-\*\w\000-\032])|(-*\d+\.\d+[%]*))+[\s]+)+\b[\w"',%\(\)]+[.!?](['"\s]|$)
Title: sentence slitter
Name: Rudolf Stammis (Alkmaar, The Netherlands)
Date: 9/23/2005 2:15:41 PM
Comment:
Nice and useful, but... when there is a number in a sentence it will sometimes have a dot (.) in the middle, and this regex also breaks such sentences. A workaround could be to glue these parts togehter again. (that is when a sentence starts with a numeral and the previous one ends in a numeral)