regex – The Boschmans Account

First, a quick reminder for myself: there’s an extremely good guide to regex on Andrew M. Kuchling’s pages.

Secondly, you don’t really *need* regex to parse for hashtags in a tweet – it’s a bit of overkill. The following code will do as well, and was written in 1 minute after searching 15 minutes in regex how to make certain to include hyphens ( – ) and other non-characters if they are put into the hashtag.

The regular expression that I find works quite well for all hashtags that don’t have a hyphen in it:

>>> hashtag = "This is a #hashtag #test-link #a should#not#work"
>>> x = re.compile(r'\B#\w+')
>>> x.findall(hashtag)
['#hashtag', '#test', '#a']

So the above code correctly finds all words beginning with a hashtage, and not the ones that contain a hashtag inside the word. Note that the hyphen and the word after it is not included.

This is the short code I wrote that does all I want:

>>> hashtag = "This is a #hashtag #test-link #a should#not#work"
>>> for word in hashtag.split():
	if word[0] == "#":
		print word		
#hashtag
#test-link
#a

In section 6 of the above-mentioned guide, Andrew states that in some cases string methods (like split) are faster than using regex. For simplicity, I’m going to use the latter code.

Update: Grrr – discovered that the tweets I am processing are in html so have href tags around them – which means ofcourse that there are no blanks for me to split words in. After another unsuccessful session with regex and just to continue I’ve used the BeautifulSoup html parsing library to get around that by stripping out all tags and then splitting the sentence up again. Probably not as efficient as immediately using regex, I’ll have to revisit this in the future.