<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Boschmans Account &#187; regex</title>
	<atom:link href="http://www.boschmans.net/tag/regex/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.boschmans.net</link>
	<description>A collection of interests and happenings...</description>
	<lastBuildDate>Wed, 01 Feb 2012 22:21:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Not using regular expressions (re or regex) to find a #hashtag (python).</title>
		<link>http://www.boschmans.net/2010/01/27/not-using-regular-expressions-re-or-regex-to-find-a-hashtag-python/</link>
		<comments>http://www.boschmans.net/2010/01/27/not-using-regular-expressions-re-or-regex-to-find-a-hashtag-python/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 22:01:08 +0000</pubDate>
		<dc:creator>alex</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>

		<guid isPermaLink="false">http://www.boschmans.net/?p=924</guid>
		<description><![CDATA[First, a quick reminder for myself: there&#8217;s an extremely good guide to regex on Andrew M. Kuchling&#8217;s pages. Secondly, you don&#8217;t really *need* regex to parse for hashtags in a tweet &#8211; it&#8217;s a bit of overkill. The following code &#8230; <a href="http://www.boschmans.net/2010/01/27/not-using-regular-expressions-re-or-regex-to-find-a-hashtag-python/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>First, a quick reminder for myself: there&#8217;s an extremely good guide to regex on Andrew M. Kuchling&#8217;s <a title="Regex for python explanation" href="http://www.amk.ca/python/howto/regex/" target="_blank">pages</a>.</p>
<p>Secondly, you don&#8217;t really *need* regex to parse for hashtags in a tweet &#8211; it&#8217;s a bit of overkill. The following code will do as well, and was written in 1 minute after searching 15 minutes in regex how to make certain to include hyphens ( &#8211; ) and other non-characters if they are put into the hashtag.</p>
<p>The regular expression that I find works quite well for all hashtags that don&#8217;t have a hyphen in it:</p>
<pre class="brush: python; title: ; notranslate">
&gt;&gt;&gt; hashtag = &quot;This is a #hashtag #test-link #a should#not#work&quot;
&gt;&gt;&gt; x = re.compile(r'\B#\w+')
&gt;&gt;&gt; x.findall(hashtag)
['#hashtag', '#test', '#a']
</pre>
<p>So the above code correctly finds all words beginning with a hashtage, and not the ones that contain a hashtag inside the word. Note that the hyphen and the word after it is not included. </p>
<p>This is the short code I wrote that does all I want:</p>
<pre class="brush: python; title: ; notranslate">
&gt;&gt;&gt; hashtag = &quot;This is a #hashtag #test-link #a should#not#work&quot;
&gt;&gt;&gt; for word in hashtag.split():
	if word[0] == &quot;#&quot;:
		print word
#hashtag
#test-link
#a
</pre>
<p>In section 6 of the above-mentioned guide, Andrew states that in some cases string methods (like split) are faster than using regex. For simplicity, I&#8217;m going to use the latter code.</p>
<p>Update: Grrr &#8211; discovered that the tweets I am processing are in html so have href tags around them &#8211; which means ofcourse that there are no blanks for me to split words in. After another unsuccessful session with regex and just to continue I&#8217;ve used the <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup html parsing library</a> to get around that by stripping out all tags and then splitting the sentence up again. Probably not as efficient as immediately using regex, I&#8217;ll have to revisit this in the future.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.boschmans.net/2010/01/27/not-using-regular-expressions-re-or-regex-to-find-a-hashtag-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

