The auto-linking code basically rewrote the whole string escaping non-ascii
characters in an inefficient way, and building a full character offset map
between the unescaped and escaped texts before sending the contents to
TwitterText's extractor.
Instead of doing that, this commit changes the TwitterText regexps to include
valid IRI characters in addition to valid URI characters.
The auto-linking code basically rewrote the whole string escaping non-ascii
characters in an inefficient way, and building a full character offset map
between the unescaped and escaped texts before sending the contents to
TwitterText's extractor.
Instead of doing that, this commit changes the TwitterText regexps to include
valid IRI characters in addition to valid URI characters.
* Update twitter-text from 1.14 to 3.1.0
* Disable emoji parsing
* Properly depend on twitter-text for url detection
* Fix some URLs being wrongly detected client-side
* Add test for server-side validation of non-autolinkable URLs
* Fix server-side status length counting
* Added .deepsource.toml
* Removed bad use of `alias`
* Fixed operand order in the binary expression
* Prefixed unused method arguments with an underscore
* Replaced the old OpenSSL algorithmic constants with the newer strings initializers.
* Removed unnecessary UTF-8 encoding comment
* Fix wrong grouping in Twitter valid_url regex
* Add support for xmpp URIs
Fixes#9776
The difficult part is autolinking, because Twitter-text's extractor does
some pretty ad-hoc stuff to find things that “look like” URLs, and XMPP
URIs do not really match the assumptions of that lib, so it doesn't sound
wise to try to shoehorn it into the existing regex.
This is why I used a specific regex (very close, although slightly more
permissive than the RFC), and a specific scan function (a simplified version
of the generalized one from Twitter).
* Remove leading “xmpp:” from auto-linked text
In cases where a URL has a trailing hyphen the FetchLinkCardService incorrectly removes the hyphen when it is parsed
The hyphen is not a reserved character in the URI spec https://tools.ietf.org/html/rfc3986#section-2.2