400th blog post !

February 6th, 2010

I just noticed that this is my 400th blog post – I started blogging using Wordpress on Januari 17, 2005 – more than 5 years ago ! Somewhere on the site there’s a link to an earlier blog of mine using Bloxsom, but I quickly switched over to Wordpress.

Wordpress has been – for me – the ideal Content Publishing Management (CMS) system.

400 posts means about 6 blog posts per month on average, something which surprised me – I actually thought it was less.

My first post was about the birthday of Tom, it’s fitting to put a picture of the family as it is now.

A picture of the Boschmans Family

A picture of the Boschmans Family end of 2009.

Onwards to another 400 blog posts or another 5 years, whichever comes first !

Simple Python threading.Thread example using Queue

February 3rd, 2010

I managed to write a really simple example of using threads in Python that I hope will give more insight on how to adapt my other programming stuff. And re-use this later on, in case I need to revisit this again it would be handy not to scour the internet again to assemble the bits and pieces of threading with Python.

The example below uses 3 threads, and processes 10 pairs of numbers (tuples) that I put in a list.

  1. # Our list of work todo
  2. inputlist_ori = [ (5,5),(10,4),(78,5),(87,2),(65,4),(10,10),(65,2),(88,95),(44,55),(33,3) ]

Those numbers are divided over those 3 threads by the Queue system.

The Queue system itself is limited to 5 slots, although this could easily be changed to more or less. You will notice in the console print that the message “Waiting for threads to finish.” appears after the fifth result, indicating that the queues are being used and the main program has continued on.

After putting everything in the queue system, the program waits for the threads to finish using the .join() function.

All spawned threads keep on being active, running forever, accepting jobs – that is, until the queue is empty, at which point they shut down.

I based most of my simple example on the examples in the Python threading tutorial (.pdf) work of Norman Matloff and Francis Hsu that I referenced before in a previous blog post. However, while their examples undoubtedly do more and are more extensive, they are also more complex. This example is deliberately made as simple as possible so to understand the basic principles of threading and the queue system.

Things I stumbled over:

  • Duh! You spawn the threads before you fill up the queues with stuff todo…
  • When printing out things to the console or python shell, things got jumbled because different threads took over from each other – to solve that I used the threading.Lock().acquire() and threading.Lock().release() to make sure that a thread could finish printing. Not sure if I understand completely all the possibilities this offers.
  • Still a bit stumped on getting more info, name, etc on the thread that is running at the moment – haven’t figured that out yet how to do that.

Feel free to comment and ask questions – if you can improve this program, please let me know !

  1. # threading test
  2. # Alex Boschmans
  3. # www.boschmans.net
  4. # January 2010
  5.  
  6. #
  7. # IMPORT SECTION
  8. #
  9. import threading, Queue
  10.  
  11. #
  12. # Variables setup
  13. #
  14. THREAD_LIMIT = 3                # This is how many threads we want
  15. jobs = Queue.Queue(5)           # This sets up the queue object to use 5 slots
  16. singlelock = threading.Lock()   # This is a lock so threads don’t print trough each other (and other reasons)
  17.  
  18. # Our list of work todo
  19. inputlist_ori = [ (5,5),(10,4),(78,5),(87,2),(65,4),(10,10),(65,2),(88,95),(44,55),(33,3) ]
  20.  
  21. #
  22. # This is called from the main function
  23. # It spawns the threads, fills up the queue with work items that the threads will use
  24. # And then waits for the threads to finish
  25. # This could use some more try:except code…
  26. #
  27. def draadje(inputlist):
  28. print "Inputlist received…"
  29. print inputlist
  30.  
  31. # Spawn the threads
  32. print "Spawning the {0} threads.".format(THREAD_LIMIT)
  33. for x in xrange(THREAD_LIMIT):
  34. print "Thread {0} started.".format(x)
  35. # This is the thread class that we instantiate.
  36. workerbee().start()
  37.  
  38. # Put stuff in queue
  39. print "Putting stuff in queue"
  40. for i in inputlist:
  41. # Block if queue is full, and wait 5 seconds. After 5s raise Queue Full error.
  42. try:
  43. jobs.put(i, block=True, timeout=5)
  44. except:
  45. singlelock.acquire()
  46. print "The queue is full !"
  47. singlelock.release()
  48.  
  49. # Wait for the threads to finish
  50. singlelock.acquire()        # Acquire the lock so we can print
  51. print "Waiting for threads to finish."
  52. singlelock.release()        # Release the lock
  53. jobs.join()                 # This command waits for all threads to finish.
  54.  
  55. #
  56. # Main thread class – based on threading.Thread
  57. # This class is cloned/used as a thread template to spawn those threads.
  58. # The class has a run function that gets a job out of the jobs queue
  59. # And lets the queue object know when it has finished.
  60. #
  61. class workerbee(threading.Thread):
  62. def run(self):
  63. # run forever
  64. while 1:
  65. # Try and get a job out of the queue
  66. try:
  67. job = jobs.get(True,1)
  68. singlelock.acquire()        # Acquire the lock
  69. print "Multiplication of {0} with {1} gives {2}".format(job[0],job[1],(job[0]*job[1]))
  70. singlelock.release()        # Release the lock
  71. # Let the queue know the job is finished.
  72. jobs.task_done()
  73. except:
  74. break           # No more jobs in the queue
  75.  
  76. #
  77. # Executes if the program is started normally, not if imported
  78. #
  79. if __name__ == ‘__main__’:
  80. # Call the mainfunction that sets up threading.
  81. draadje(inputlist_ori)

Sigh. I just finished adding spaces to show where a def ends, and the damn code highlighter removed it again. Grrrrrr.

Not using regular expressions (re or regex) to find a #hashtag (python).

January 27th, 2010

First, a quick reminder for myself: there’s an extremely good guide to regex on Andrew M. Kuchling’s pages.

Secondly, you don’t really *need* regex to parse for hashtags in a tweet – it’s a bit of overkill. The following code will do as well, and was written in 1 minute after searching 15 minutes in regex how to make certain to include hyphens ( – ) and other non-characters if they are put into the hashtag.

The regular expression that I find works quite well for all hashtags that don’t have a hyphen in it:

  1. >>> hashtag = "This is a #hashtag #test-link #a should#not#work"
  2. >>> x = re.compile(r\B#\w+’)
  3. >>> x.findall(hashtag)
  4. [‘#hashtag’, ‘#test’, ‘#a’]

So the above code correctly finds all words beginning with a hashtage, and not the ones that contain a hashtag inside the word. Note that the hyphen and the word after it is not included.

This is the short code I wrote that does all I want:

  1. >>> hashtag = "This is a #hashtag #test-link #a should#not#work"
  2. >>> for word in hashtag.split():
  3.         if word[0] == "#":
  4.                 print word             
  5. #hashtag
  6. #test-link
  7. #a

In section 6 of the above-mentioned guide, Andrew states that in some cases string methods (like split) are faster than using regex. For simplicity, I’m going to use the latter code.

Update: Grrr – discovered that the tweets I am processing are in html so have href tags around them – which means ofcourse that there are no blanks for me to split words in. After another unsuccessful session with regex and just to continue I’ve used the BeautifulSoup html parsing library to get around that by stripping out all tags and then splitting the sentence up again. Probably not as efficient as immediately using regex, I’ll have to revisit this in the future.

Using threads in Python

January 26th, 2010

I’ve been trying to setup threading in Python, so that in the back-end of my service system that I’m developing I can query more than one source at the same time. So instead of querying one server and waiting for feedback, I can launch 10 threads and thus query 10 servers and process each server’s feedback via it’s own thread.

So a very vague, generalising definition of a thread is an independent ‘process’ that performs a job that you give it. You can control how many threads that you launch. Each thread is a copy of the original thread that you describe (in essence a python def function that has been wrapped in a thread class).

Right now, my understanding of threads is a bit confused. So far it seems that threading has several different manners of implementing them:

  • using a number of threads that you launch, use, and forget about them (they go away)
  • improving on that by putting those threads in a thread pool, and when a thread finishes, re-using it for the next job (so you have  5 threads but 10 jobs to do, those five threads take five jobs, and the first thread that finishes takes on the sixth job, the second thread to finish the seventh job, and so on)
  • the final step seems to be (I haven’t got that far in my implementation) to set up worker bees that are managed by one thread (a better description is promised, as soon as I have understood it!)

Since I’ve been scouring the net for information over threads, here is a list of resources that discuss, give examples, and explain threads – it’s useful for me to refer to, it might be useful for you as well :

  • DaveN has an extensive post, with examples, building up gradually. It’s only at the end that you read that the code shown has never been run, which is a bit of a letdown. Still worth a good read though !!
  • A very thorough 25-page pdf documents that starts from the beginning is available on the site of UC Davis, University of California. It goes into all the nitty gritty details.
  • An example that uses workers in threads is found on the blog of Danial Taherzadeh.
  • Another one that discusses using multiple queues chained together can be found on IBM’s developerworks site.
  • And the blog post from Halotis that started my looking into threads…

Right now I’m using threads in a thread pool, but I’m not doing something right – I noticed that while I have 10 jobs to do, only the first five get done, and the others ‘disappear’.

I guess the only way to get it working is to continue reading the information above until it makes sense. Sometimes I wonder if I’m not slightly masochistic, looking for challenges like that… ai me poor pounding head ! :-)

Parrot AR.Drone – instant lust.

January 13th, 2010

Wow. I just saw the advert for a new type of remote controlled helicopter called the Parrot AR.Drone, and I experienced instant want-it syndrome. It’s controlled via your iPhone, it’s not out yet, somewhere in 2010 but from what you see in the demo it is amazing !

It is helicopter on steriods, typically geek stuff, and it looks damn cool.

Excel 2008 for Mac insists on using mm/dd/yyyy instead of dd/mm/yyyy !

January 1st, 2010

I’ve just discovered a really bothersome bug using Excel 2008 for Mac (I’m using the current latest version 12.2.3 of Excel 2008 for Mac). My date columns keep on insisting on using the US format of month-days-year, even when the mac time setting is set to Brussels, Belgium. There does not seem to be a way to fix the setting in Excel, and similar date format problems appear on forums already since Excel 2004.

The workaround I found was to select the date format 01-Mar-2009, in other words show the month as a word instead of as a number. This way I’m sure that the date I’m typing in is correct. But really this is annoying me immensely.

Another workaround (haven’t tried it yet) so that Excel uses the underlying Mac date format is to NOT preselect CELLs and format them as DATES – just type the date in a normal cell using slashes and Excel will then convert the values to the correct international date as set by the mac… I’ll try that next time.

WHY hasn’t this been resolved yet !? What’s so hard about doing this the right way ? I know the answer ofcourse – it’s a big company and everyone there has to jump through 15 hoops simultaneously while hopping on one leg and signing delivery forms with a pen that only works 1 in 3 times :-)

My HD Recorder KISS DP-558 is dead. So is the Kiss website.

December 27th, 2009

Kiss Technology is dead

After 4 years of usage, my Kiss DP-558 recorder is dead. Done. No longer working. And the Kiss website is down as well. Cisco/Linksys has taken over and shut down Kiss Technologies. More’s a pity that they never *did* anything with what they bought…

A mains current failure that came and went made sure that it died an agonizing death, always switching on and then switching off again. I’m sorry to see you go mate, you earned the ‘wife-acceptance’ trophy, which is hard to come by.

So I’m looking for a replacement. I’ve finally settled on a Emtec S800 (I originally bought the S800H, but this only has a digital tuner, not an analog, and we’re still (stubbornly) analog.

Still, no more Electronic Program Guide to pick and choose from. And still no H.264 (which the S800H did have) codec support. And no more having something decent looking under your televison that fitted in with the Hi-Fi equipment.

Bummer.

What about flex on this blogpost ?

December 19th, 2009

For those few regular readers out there, they have probably noticed that I no longer post regularly about Adobe Flex.

Please be assured that this is not out of the picture ! Rather, I wanted to learn Flex enough to get by in it. It’s been *very* interesting, but also very hard sometimes to wrap my head around Actionscript and MXML. Now that I know a bit about what I can do with Flex, I’ve started again with Python and more specifically with CherryPy.

CherryPy is a very easy-to-use web framework that you can use to set up your own webserver in a flash. It provides a basic syntax for setting up the webservice, then scurries out of the way, letting you ‘get on with it’, whatever that may be.

Currently I’m setting up a local Webserver (using CherryPy) and this is where most of my time has gone to.

Once the python application on there has been created (and most of it has) I then will head back to Flex and it’s usages as a reporting tool – I’ll be trying to use PyAMF as the glue between python functions and Flex datagrids.

Anyways, more on that later…

Feedparser.py and it’s uses…

December 19th, 2009

I recently discovered feedparser.py, a library written by Mark Pilgrim that is amazing if you want to use python to consume rss feeds. It ‘normalizes’ the different versions of rss/atom out there into one request that you can use consistently. Doesn’t matter if it’s atom 0.1 or 0.3

A few links that are interesting together with feedparser.py as they show it’s usage:

I’m constantly amazed about the quality python code that is out there and you can just find via a simple google query. It certainly makes me think that choosing Python over, say Perl, was a good decision.

As for using feedparser.py to put relevant tweets on your website, note that you can also use javascript to achieve the same thing; go here for some twitter.com goodies and an explanation on how to set this up.

Cleaning up user input variables on the web (Python)

December 5th, 2009

Only recently I’ve discovered the power of ‘re’ the python regular expression library. Instead of writing long functions that process text character by character to add or remove stuff, you use re, write and expression in regex that achieves what you want and basta! in a few lines things get done.

For example the following function will remove any html tags (preventing Cross Site Scripting) and escape the rest of whatever the user types in:

  1. # Remove html tags and escape the input
  2. def scrapeclean(text):
  3. —-# This matches open and closing tags and what’s between them
  4. —-x = re.compile(r‘<[^<]*?/?>’)
  5. —-# Replace to nothing using sub and escape what’s leftover and return the result all in one line!
  6. —-return cgi.escape(x.sub(,text))

Remove the dashes when you copy the code – they were added to show the necessary indentation. And for full disclosure : I took the compile statement from the following site (I’m not a regex expert).

So you can call this function from somewhere in your python code and the result will be ’scraped clean’ of all tags beginning with < and ending with > plus any ampersands other other special characters get to be ‘escaped’.

YMMV – this is very likely not a complete protection against all the things a hacker can input in your website, but it’s certainly a start.

Archives
Search this blog
My Webhost