PDB style function backtrace in python using Inspect module.

A while back I had a conversation with a friend of mine, which went something like this(with some paraphrasing).

Friend : Hey, is there a way in python to display the function calls which end up calling a certain function, kind of like a function call flow but displaying only the ones which call a certain function.
Me : Have you tried a python debugger like PDB? Check out the backtrace command.
Friend : Oh, I Don’t have any idea about PDB, and there are only few such functions, I don’t need any other debugger functionality, Debugger may be bit of a overkill here.
Me : Oh Ok, Let me see what I can do!!.

I found it to be the perfect excuse to play with Inspect Module, which provides several useful functions to help get information about live objects in a running program.
About 5 minutes later, I had the working code. (Isn’t python Freakin Awesome!!)

import inspect
def getBackTrace():
    for idx, frame in enumerate(inspect.stack()[1:-1]):
        print '#%d  %s at %s'%(idx, frame[3], frame[1])

Suffice it to say, that his problem was solved. Now he can just invoke getBackTrace() to get the list of functions in outer frames from within a function. 🙂

On my way back home that day, I thought that this could be a good utility for someone who loves to read other people code, but considers a full debugging session a overkill to get a better understanding of the unfamiliar code.You can also think of it as a addition to the old school Printf Debugging.FWIW, If it’s well written code in python, most of the times you will just get what the code is trying to do. Python code is really just executable pseudo-code.

My friend suggested that I should add this code to my collection of Gists@github. I asked him if he can think of some other features I should be adding to this code. After some talking, we narrowed down to the following two nice-to-have features.

  • Add some more context, by printing function arguments, source code line number for each of the frames.
  • The ability to log this information to a external log file, defaulting to the standard outstream.

It took me some more time to do the code/testing, but it was worth the effort. Overall what’s really great about this utility function is that tracing a function call is as easy as putting @getBackTrace() decorator before the function definition and just run the code. The other ( rather tedious) way is to start a pdb debugger session, setting the function breakpoint, running the code and then inspecting the stack once breakpoint is reached. So choose the way you like ;). Here is the code. (Tip: Looks a lot better @Github)

import sys, inspect
def getBackTrace(printArgs=None , logger=sys.stdout.write):
    '''
    This decorator method can be used to get GDB style backtrace (Stack trace) of 
    currently active stack frames during the execution of a program, 
    optionally printing positional/variable/keyword arguments for the 
    functions in the outer frames.
    It can be used to easily trace the function call flow in a running application, 
    without having to use a external debugger like pdb.
    
    @Input
      printArgs - This is a boolean value which is true/false depending on 
                  whether function arguments need to be displayed.
      logger    - This can be a python logger instance to log information to a log file.
                  By default information is printed on stdout.
    @Output
      Output is similar to GNU Debugger Stack traces, the difference being 
      in the way arguments are displayed.
      More info on GDB backtrace command here - 
      http://inside.mines.edu/fs_home/lwiencke/elab/gdb/gdb_42.html

    Note: AFAIK, Once a function is called with the respective arguments, 
    these arguments become part of locals() function scope in python, and they can be 
    mutated by function code. So, the argument values may be **DIFFERENT**
    than what the function initially may have got and I am only displaying 
    values the variables had at the time stack trace was done.
    
    '''
    log = logger or logger.debug
    log('*'*50 + ' BackTrace START ' + '*'*50+'\n')
    def wrap(f):
        log("#0   %s at %s\n"%(f.__name__, f.__module__))
        def wrapped_f(*args, **kwargs):
            outerFrames = inspect.stack()[1:-1]
            for frame_idx, frame in enumerate(outerFrames):
                functionName = frame[3]
                moduleName = inspect.getmodulename(frame[1]) or frame[1]
                traceStr = "#%d   %s at %s:%s\n"%(frame_idx+1, functionName, 
                                                  moduleName, frame[2])
                if printArgs:
                    argObj = inspect.getargvalues(frame[0])
                    posArgs = sorted([ [arg, argObj.locals.get(arg)] for \
                                       arg in argObj.args or []])
                    varArgs = sorted(argObj.locals.get(argObj.varargs,{}))
                    keywordArgs = sorted(argObj.locals.get(argObj.keywords,{}).items())

                    prettyArgsString = ', '.join('%s=%s'%(str(item[0]), 
                                                          str(item[1])) for item in posArgs)
                    prettyVarArgsString = ', '.join('%s'%(str(item)) for item in varArgs)
                    prettyKeywordArgsString = ', '.join('%s=%s'%(str(item[0]), 
                                                                 str(item[1])) for item in keywordArgs)

                    traceStr+="%-10sPositional Arguments : %s\n"%('',prettyArgsString)
                    traceStr+="%-10s*Variable Aruguments : %s\n"%('',prettyVarArgsString)
                    traceStr+="%-10s**Keyword Arguments  : %s\n"%('',prettyKeywordArgsString)
                    traceStr+= '\n'
                log(traceStr)
            log('\n')
            log('*'*50 + ' BackTrace END ' + '*'*50+'\n\n')
            f(*args, **kwargs) #Call the function
        return wrapped_f
    return wrap



#################################
## Example Usage
## Output
"""
#0   func3 at __main__
#1   func2 at getBackTrace:67
          Positional Arguments : i=10000
          *Variable Aruguments : 1230, 12345
          **Keyword Arguments  : oi=20

#2   func1 at getBackTrace:64
          Positional Arguments : i=100
          *Variable Aruguments : 
          **Keyword Arguments  : 


"""
def func1(i):
    func2(i**2, 1230, 12345, oi=20)

def func2(i, *args, **kwargs):
    func3(i**2)

#@getBackTrace()
@getBackTrace(printArgs=True)
def func3(i, **kwargs):
    pass

if __name__ == '__main__':
    func1(100)

If you are thinking of some other Nice-to-have features, The code is just one Fork away, or give me a shoutout in the comments below.

Posted in code, hack | Tagged , , , , , , , , | Leave a comment

notify.sh: Get notified when shell scripts go wrong.

TLDR:

Project Summary: Can be used to automatically suspend execution of a shell script if runtime error occurs, followed by notifying the user who started the script. The suspended script can be easily resumed later by the user. This is particularly helpful in a scenario where you are running a script which does lot of work and you don’t want to do the cleanup and start the script again in case a runtime error occurs. In such cases script suspend/resume functionality can save you a lot of time.
Project Page: Notify.sh

I got the idea for this project a few weeks back, while running a large-scale DB migration on production servers at my day job, on how to handle the process errors better. Though I had tested migration process multiple times on all the possible things that could wrong, I couldn’t just ignore Mr. Murphy completely.

I also had few other rather obvious things to take care of like, website downtime required if any must be least possible, and I also was given relatively short time for process to be completed. This means if a problem occurs after the process has started I probably won’t be able to re-run the process from scratch, and would have to wait for next time window to schedule the migration process.

But there was a catch, if I could just suspend the process if/when the error occurs, then I could see what caused the error and I can just resume the process , instead of having to restart the process from scratch..
The task consisted mostly of – executing SQL and shell scripts in a Linux environment.

The main script(run.sh) had the following format –

<pre-processing/ script init routines>
<sql_statement_1>
<sql_statement_2>
<sql_statement_3>
…..
<post-processing/ process validation routines>

The process was originally designed, in a way that process(or a single step) failure wouldn’t cause any unintended side effects, similar to the ALL or NONE transactional philosophy of databases. So I only was optimizing on the not-having-to-restart the process part if some error occurs.

I thought of adding a check step after every SQL statements, which would check what happened to last executed statement and will suspend the shell script if previous step had an error. My script format after this change will be –

<pre-processing/ script init routines>
<sql_statement_1>
Check state of sql_statement_1
<sql_statement_2>
Check state of sql_statement_2
<sql_statement_3>
Check state of sql_statement_3
…..
<post-processing/ process validation routines>

After deciding on to the what part, I had the problem of how and for following questions.

  1. How to check status of the last executed step in a shell script?
  2. How a running shell script could  suspend itself?
  3. How to notify user of the error situation, also providing the necessary information?
  4. How the suspended process can be resumed later by the user?

My solutions were as follows:

     

  1. How to check status of the last executed step in a shell script?
  2. Two words – “Exit Status”. And the fact that exit status of the last executed command in Linux is stored in the bash variable – $?

  3. How a running shell script can be able to suspend itself?
  4. This was an interesting question, and the answer surprised me, Named Pipes. I read about it long time back and found it interesting back then, but wasn’t able to find any use-case for it until todal. The basic idea is that
    the process about to get blocked will create a named pipe with a unique name(*Process ID) and will be at the receiving end of the pipe. This will suspend (*Cough* I/O block) the process until it gets the confirmation message (through the named pipe) to resume from the user.

    This entire process looks something like this:

    Process Commnuncation

  5. How to notify user about the error situation, providing the necessary information?
  6. Sending an e-mail to the user with process information and instructions to resume the process would do just fine. Another BIG plus is, you need not monitor script status continuously for any errors and will automatically get a notification e-mail when/if an error occurs.

  7. How to resume the suspended process later?
  8. User just has to send the string “RESUME” though the named-pipe to the blocked process, which will then break out of the I/O block and continue with the next step. He will also get the instructions on how to do this as a part of notification e-mail sent.

By the combined awesomeness of python+Linux, It took me one evening to code this entire project. I decided to release the code into open-source later, and was given the permission to do so, thanks, Serendio. 🙂
The project can be accessed here. So enjoy the code and if you can think of ways to make code better/ or more features. notify.sh is just one fork away.

Posted in code, hack | Tagged , , , , , , | Leave a comment

Python Global Interpreter Lock (GIL) explained – Pycon Tech Talk

Today I saw a amazing tech talk on the much debated topic of python Global Interpreter Lock a.k.a GIL. Python’s Global Interpreter Lock or GIL is probably one of the least properly understood parts of python, but is extremely important to know of if you are dealing with any form of multi-threaded python code.
The talk debunks several GIL myths, explains GIL/Interpreter internals and introductory talk about the better GIL implementation coming up in version 3.2 of python. Apart from the highly informative talk I liked speaker’s way of approaching the rather subtle topic in a manner that even novice programmers could relate to, showing interpreter internal code as required, demos showing GIL influencing overall program performance in cases when multi-core systems, running multiple threads etc are involved, and last but not the least the occasional geek humor.

Few takeaways from the Talk:

  1. Python threads are native threads (POSIX threads on Unix) and not Green threads.
  2. There can only be one thread running at any given time in a python process regardless of whether you are using multiple threads, due to the way GIL works, independent of the OS multi-threading support. In other words only one thread can have mutual exclusive access to GIL at any given time.
  3. Splitting a process into multiple threads won’t speed up the execution, On the contrary it might degrade overall performance as now these threads will compete for GIL access.
  4. If you are running multiple threads on a multi-core system, performance might be even worse than compared to single core machine, as OS now supports running multiple thread simultaneously, but GIL will still execute only one thread. So OS will try resuming suspended  threads which will then try to get hold of the GIL, which will fail because some other thread is already using GIL. In other words there will be lot of False alarms, which will cause thrashing leading to overall degraded performance.

While you should not worry about your multi-threaded code having performance issues because of the GIL, more often than not it’s because of badly written code. However as a general rule of thumb if you don’t need  to synchronize threads/shared resources access but are only concerned about reducing process execution time, you are probably better off using multiple processes instead of multiple threads. See Multiprocessing for further information on how multiprocessing can replace multi-threading for this case.

Posted in Tech talks, Video | Tagged , , , , , , | Leave a comment

Workaround for google app engine’s – File not accessible error using virtualenv

Google App Engine comes really handy for devs who want to host their application idea/hack on google’s infrastructure without having to spend any $$ for it (and the resource limits are pretty descent for free account). No wonder it is a really useful tool that every developer/hacker should know of.

Few days earlier while trying to implement some pet project idea of mine at google app engine, I got a strange long exception(probably sandbox violation error) when I tried to access the running app engine application in my browser.

Here is the truncated traceback message (App engine internal function calls removed)

WARNING 2010-09-28 19:13:59,854 py_zipimport.py:103] Can't open zipfile /usr/local/lib/python2.6/dist-packages/pyExcelerator-0.6.0a-py2.6.egg: IOError: [Errno 13] file not accessible: '/usr/local/lib/python2.6/dist-packages/pyExcelerator-0.6.0a-py2.6.egg'

........

File "/home/ashish/Desktop/google_appengine/google/appengine/tools/dev_appserver.py", line 2188, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/home/ashish/Desktop/google_appengine/shuffler/shuffle.py", line 17, in
import cgi

.......

 

File "/home/ashish/Desktop/google_appengine/google/appengine/tools/dev_appserver.py", line 1186, in __init__
raise IOError(errno.EACCES, 'file not accessible', filename)
IOError: [Errno 13] file not accessible: '/usr/local/lib/python2.6/dist-packages/cgi.unescape-0.0.2-py2.6.egg'
INFO 2010-09-28 19:13:59,865 dev_appserver.py:3246] "GET / HTTP/1.1" 500 -

I had no idea why I got this error as cgi module was available and even import cgi worked on python interpreter prompt. It looked like some kind of sandbox error as I knew google app engine runs application in a security sandbox. Googling for the error didn’t do much help. Maybe I screwed up directory permissions on my *nix system, or maybe the external python libraries present on my system are giving app engine a hard time.

I remembered from some time ago while working on Pylons, which is another nice pluggable python based web framework, that it recommended installing virtualenv as a part of pylons installation step.

virtualenv is a tool to create isolated Python environments. It comes quite handy when one want to install a python library/framework, while avoiding any side-effects to the existing python installation.

As a formal disclaimer, While this process should do no harm, still proceed with caution. Also Note that this has been ONLY tested with python version 2.6 on Ubuntu 10.04 OS.

sudo easy_install virtualenv==1.4.9 (Or if you prefer apt-get install or pip)

Assuming that :
1) You have python version 2.6 installed.
2) You want to install this separate python installation on ~/GAE_PYTHON directory.

Create the new python environment by running
virtualenv -v -p python2.6 –no-site-packages ~/GAE_PYTHON_ENV

Assuming that last step completed successfully.

Now run this command to switch to the new created python environment –
source ~/GAE_PYTHON_ENV/bin/activate

You should now see your shell prompt prefixed by (GAE_PYTHON_ENV) (Note: I Use bash shell)

If the process so far completed without any errors, try running any of the demo application that comes with Google APP engine (While remaining in the same environment ) to quickly check if the new setup works. I tried guestbook demo application that ships with app engine’s setup and was able to use it without any difficulties.

Note that now this setup doesn’t have access to python libraries that you may have installed on your base python setup, so you may need to install required libraries for this setup separately. You should be able to install any library in the usual way using easy_install or pip while working from within the new environment.

While this will probably work for the system I have described above, if you happen to have a different system setup, try looking for specific installation instructions on how to set up virtualenv on your system.

Posted in code, troubleshooting | Tagged , , , , , , | Leave a comment

Find Kth minimum element in a unsorted array

Some days ago while interviewing a candidate, I asked the candidate a very easy by the book interview question.

Design an efficient algorithm to find the 2nd smallest integer in a unsorted integer array of length N, which should run in constant space.

This question is quite easy to solve, and google search returns tons of results for the question.
In brief one way in which the above problem can be solved is, to store both the smallest element and the 2nd smallest element while traversing the array.This approach will require array to be traversed only once .

After the interview was finished I thought of the generalized variant of the above problem, which is  –

Design an efficient algorithm to find the Kth smallest integer in a unsorted integer array of length N which should run in constant space.

As one can see using the same approach we used for k=2, while will be a valid approach(Space complexity – O(k)), It will not be efficient for cases in which K ~ N/2 (Think!!!)

I thought whether using a appropriate data structure might work here –

Heap : O(nlg(n)), Building the Heap for n elements will take O( nlg(n) ) time.
Binary search tree: Building the tree will take O(n lg(n)) time.

So, as you can see I could as well sort the array in O(n lg(n) ) time and return the element at Kth index.
Lets see if can do better.

Hint: We can make use of partition step of QuickSort, to find the kth minimum element.
Think about it, or read further for the details/implementation code.

Step1: Partition the Array.
So if you can remember partitioning step in quicksort algorithm (if not, read below for a quick overview)

From the input array select a random element, and call it the pivot element.
Partition the array in such a way that
All the array elements which are less than or equal to pivot are on left side of the pivot.
All the array elements which are more than pivot are on right side of the pivot.

And Now the meat of the algorithm:

lets say that pivot_idx is the final position of the pivot after partitioning.
That means that all elements values less than pivot are placed before it
Which means that pivot is Kth smallest element for the array for k = pivot_idx.

Step 2: Repeat Step 1 based on some conditions(Divide array into smaller sub-array)
It comprises of 3 simple steps.

If k = pivot_idx .return pivot value as the answer.
if k < pivot_idx. the answer lies in left partition and so partition existing left partition.
if k > pivot_idx. the answer lies in right partition and so partition existing right partition.

Algorithm complexity analysis
Though this approach have a worst case complexity as Quick Sort, because it reuses its partition (Divide) step.
On the positive side, it also gives good average case time complexity, similar to quicksort.
I can also replace existing naïve partitioning algorithm with a more efficient version, which most of the times can result in partition divide by half.
In that case, the number of comparisons I have to do will be:
n + n /2 + n/4 + n/8 … (A geometric progression with ratio 0.5) = 2n .
So the algorithm is not so bad after all 🙂

And as always, Algorithm’s implementation in python (Download Code)

#!/usr/bin/env python
from random import randrange

def findKMin(arr,k,start=0,end=None):
    '''
    Find kth minimum element in a array (in-place randomized algorithm, similar to quicksort)
    assumption: Input will only contain unique elements'''
    if k > len(arr):
        raise Exception("k should be less than length of the input array")
    if not end: end = len(arr) -1 #Get last index value
    pivot_ridx = randrange(start,end)     #Get a random array element as pivot value
    pivot = arr[pivot_ridx]
    pivot_idx = partition(arr,start,end,pivot_ridx) #partition to partition array around the pivot value in place
    if pivot_idx+1 == k:
        return pivot #Well, there is your answer
    elif pivot_idx+1 > k:
        return findKMin(arr,k,start,pivot_idx) #lies somewhere in the first partition
    else:
        return findKMin(arr,k,pivot_idx,end) #lies somewhere in the second Partiton

def partition(arr,start,end,pivot_idx):
    '''
    Partitions array in-place around the given pivot value
    '''
    pivot = arr[pivot_idx]
    arr[end],arr[pivot_idx] = arr[pivot_idx],arr[end]
    inc_idx = start
    for i in xrange(start,end):
        if arr[i] <= pivot:
            arr[inc_idx],arr[i] = arr[i],arr[inc_idx]
            inc_idx+=1
    arr[end],arr[inc_idx] = arr[inc_idx],arr[end]
    return inc_idx

if __name__ == '__main__':
    from random import shuffle
    test_input = range(100000)
    shuffle(test_input)
    assert findKMin(test_input,50000) == 49999
Posted in Algorithm | Tagged , , , , , , | 4 Comments

Generating all possible permutations of a sequence.

P.S. – This Post is intended for people who want to know a simple,efficient approach to generate all possible permutations for a sequence.For real world software I would recommend using standard library functions.

All of us at some moment while coding have faced the problem on how to generate all possible permutations of a sequence.
In this case algorithm’s efficiency becomes highly important. Use a recursive solution and for even moderately sized sequence, running time (and memory) goes though the roof.
Recently I came across Donald knuth’s elegant linear time non-recursive approach. I wanted to see algorithm in action, so I implemented it and thought it would be nice to write a blog post about it.

So Without further ado, I have given below a simple explanation of algorithm working, and code implementation in both java,python.

Algorithm Summary
The idea is to generate permutations of the sequence so given, in lexicographical increasing order.

for ex. for an array [1,2,4,3]
Resulting Permutations would be [1,2,3,4], [1,2,4,3] …. [4,3,2,1]
One approach on how to generate this ordered permutation can be read from its  wikipedia entry.

Python implementation (Download code):


 def getNextPermute(arr):
     """
     Implementation of http://en.wikipedia.org/wiki/Permutation#Systematic_generation_of_all_permutations
     Algorithm to effeciently generate permutations of a sequence
     returns a generator type object, which for each call return next possible permutation
     until all possiblities are exhausted
     """
     arr.sort()
     yield arr[:]
     while True:
         for j in xrange(len(arr)-2,-1,-1): #find larget a[j]
             if arr[j+1] > arr[j]:
                 break
         else:
             return

         for l in xrange(len(arr)-1,j,-1): #find larget a[j]
             if arr[l] > arr[j]:
                 break

         arr[j],arr[l] = arr[l],arr[j] #swap a[j],a[l]
         arr[j+1:] = arr[j+1:][::-1] #reverse array present after the index (j+1)th
         yield arr[:]

 if __name__=="__main__":
     assert len(list(getNextPermute([5,1,2,3,4]))) == 120
     for each in getNextPermute(["a","b","c"]):
         print ','.join(each)

Java Implementation (Download Code):




import java.util.*;

public class Permute<E>
{
  /**
   *    Implementation of http://en.wikipedia.org/wiki/Permutation#Systematic_generation_of_all_permutations                              
          Algorithm to effeciently generate permutations of a sequence                                                                                                                   
          until all possiblities are exhausted                    
           
   */
  private int[] arrIdxs;
  private ArrayList<E> arr;
  public ArrayList<E> get_next()
  {
    ArrayList<E> ret = new ArrayList<E>();
    for (int idx = ; idx < arrIdxs.length;idx++)
      ret.add(idx,arr.get(arrIdxs[idx]))//Permute integer based array indexes, which can be used to get permuted array in return
    return ret;
  }
  public Permute(ArrayList<E> arr)
  {
    this.arr = new ArrayList<E>();
    for (E each:arr)
      this.arr.add(each);
    arrIdxs = new int[arr.size()];
    for(int i = ; i < arr.size();i++)    //Set indexes lexicographically minimal value;
      this.arrIdxs[i]= i;
  }

  public boolean next_permutation()
  {
    int i,j,l;

    for(j =arr.size() -;j >=0  ; j--//get maximum index j for which arr[j+1] > arr[j]
      if (arrIdxs[j+1> arrIdxs[j])
        break;
    if (j == -1//has reached it's lexicographic maximum value, No more permutations left 
      return false;

    for(l = arr.size()-1;l > j ; l--//get maximum index l for which arr[l] > arr[j]
      if (arrIdxs[l> arrIdxs[j])
        break;
    
    int swap = arrIdxs[j]//Swap arr[i],arr[j] 
    arrIdxs[j= arrIdxs[l];
    arrIdxs[l= swap;

    for (i = j+1;i < arrIdxs.length;i++//reverse array present after index : j+1 
    {
      if (i > arrIdxs.length - i + j )
        break;
      swap = arrIdxs[i];
      arrIdxs[i= arrIdxs[arrIdxs.length - i + j];
      arrIdxs[arrIdxs.length - i + j= swap;
    }

    return true;
  }    
  public static void main(String[] args)
  {
    /**
     * Test Run
     */
    ArrayList<String> test_arr = new ArrayList<String>();
    test_arr.add("ab");test_arr.add("b");test_arr.add("c");test_arr.add("d");
    Permute<String> test = new Permute<String>(test_arr);
    while (true)
    {
      System.out.println(test.get_next().toString());
      if (!test.next_permutation())
        break;
    }
  }
}

Java2html


Posted in Algorihtms | Tagged , , , , , , | 2 Comments

Richard Feynman on Ways of Thinking

I am a big fan of  Richard Feynman.

He has this unique ability of explaining what happens around us daily in a way as fascinating and interesting as they can be. He makes us failing in love with science (not that science is not lovable already ;)).
No wonder he is treated as rockstar in the field of science, a title only hold by Einstein before him.
I wish I had a teacher explaining stuff like he does while I was still in school.. 😐

Recently I came across two such fascinating youtube videos in which he talks about ways in which people think (via HackerNews)

Enjoy watching these videos 🙂

Posted in People | Tagged , , , | Leave a comment

How Google News Recommendation Engine Works!! [Video]

I just finished watching a Great Tech-Talk on how Google news recommends relevant news items one might be interested in and
I thought it would be nice to share this video on my blog.

Please note the talk was given in 2008, but is quite informative None the less.

Some Takeaways:

They mainly use collaborative algorithms algorithms family for coming up with stories to recommend.
Using this not only can they leverage google’s large user base that they have to their advantage, but it also makes algorithms domain independent, so the same algorithms can also be used to some other domain like google video, etc.

They use weights on various applicable methods to come up with a net weighted score for  all the stories.

Some of the factors influencing news recommendation:

  1. Stories which users who have similar interest as yours have clicked [MinHash,PLSI]
  2. Stories that you have co-visited with the stories you clicked.

Recommendation are category specific, so your news clicks in entertainment section won’t affect items recommended in technology section.

Author also talks about how to make these algorithms scalable enough to run them effectively,  which is mostly achieved by using Map-Reduce , a well known distributed computing standard.

To people genuinely interested on this topic, I would suggest reading another great book on the subject – programming Collective Intelligence: Building Smart Web 2.0 Applications
A great read for the researchers and the newbies alike.

Posted in Technology | Tagged , , , , , , | Leave a comment

PostgreSQL tips and tricks !! (Part 1)

For a long time I wanted to have an online catalog of various sql nifty queries and hacks that i have found helpful and continue to use till date .
Many times  I got help from various blogs written on this topic.
Now It’s my time to give back something to the community, I learned so much from :).

This post is most useful for people who have intermediate to advanced knowledge of postgresql/SQL.

Note : Please use these queries at your own risk, hopefully it should help you get your work done without any hiccups 🙂

1) Using query aliases in table joins for better performance:
While doing a join between two (or more ) tables, using aliases gives gives performance gain over ordinary joins in certain cases. Like for ex.
Suppose there are two tables having Bank information (bankinfo) and Transaction Information (trinfo) , They share a common column named IFSC number.
I need Bank Information,and total number of transactions present for each bank. Now I can write needed sql query in two ways.

select bankinfo.name,trinfo.ifsc,count(*) from bankinfo left outer join trinfo on bankinfo.ifsc=trinfo.ifsc group by bankinfo.name,trinfo.ifsc;

Or using query aliases

select bankinfo.name,q.count from bankinfo left outer join (select ifsc,count from trinfo group by ifsc)q on bankinfo.ifsc = q.ifsc;

You can see that we are reducing rows that would need to be joined, which will most of the times will result in improved performance.This mostly works if join involves one of the tables which is relatively large in size or if you are using group by clauses.

Continue reading

Posted in Hacks, Technology | Tagged , , , , , | 1 Comment

Python Wrapper for Google Translate AJAX API.

Few Days Ago, i needed some Non English documents to be translated to English language, for which Google AJAX Language API can be used.

I wrote a small script which got the work done ,and why put it to waste when there is a chance that it can be used by others.

So without further ado, here is the code :), Feel free to use it in any commercial/open source projects

 #By Ashish Yadav, released under WTFPL
 #http://sam.zoy.org/wtfpl/
 # -*- coding: utf-8 -*-

 import sys
 import urllib
 import simplejson

 baseUrl = "http://ajax.googleapis.com/ajax/services/language/translate"

 def getSplits(text,splitLength=4500):
     ''' 
     Translate Api has a limit on length of text(4500 characters) that can be translated at once, 
     '''
     return (text[index:index+splitLength] for index in xrange(0,len(text),splitLength))


 def translate(text,src=' ', to='en'):
     '''
     A Python Wrapper for Google AJAX Language API:
     * Uses Google Language Detection, in cases source language is not provided with the source text
     * Splits up text if it's longer then 4500 characters, as a limit put up by the API
     '''

     params = ({'langpair': '%s|%s' % (src, to),
              'v': '1.0'
              })
     retText=''
     for text in getSplits(text):
             params['q'] = text
             resp = simplejson.load(urllib.urlopen('%s' % (baseUrl), data = urllib.urlencode(params)))
             try:
                     retText += resp['responseData']['translatedText']
             except:
                     raise
     return retText


 def test():
     msg = "      Write something You want to be translated to English,\n"\
         "      Enter ctrl+c to exit"
     print msg
     while True:
         text = raw_input('#>  ')
         retText = translate(text)
         print retText


 if __name__=='__main__':
     try:
         test()
     except KeyboardInterrupt:
         print "\n"
         sys.exit(0)

Python Code Formatting is done using the awesome Python2Html utility

Also present as a recipe on Activestate

Posted in Uncategorized | Tagged , , | 1 Comment