Thursday, September 3, 2015

H-Index [LeetCode]

Problem Description
Given an array of citations (each citation is a non-negative integer) of a researcher, write a function to compute the researcher's h-index.
According to the definition of h-index on Wikipedia: "A scientist has index h if h of his/her N papers have at least h citations each, and the other N − h papers have no more than h citations each."
For example, given citations = [3, 0, 6, 1, 5], which means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively. Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, his h-index is 3.
Note: If there are several possible values for h, the maximum one is taken as the h-index.
Solution

As we know, H-Index is a measure of productivity and citation impact of a researcher. It is also called Hirsch index or Hirsch number after the name of the physicist Jorge Hirsch.
Picture 1: H-Index for decreasing citations array (Credit: Wiki)

Supposed that citations is an array of numbers of citations of a researcher. We can easily see that the following formula holds if citations is sorted in decreasing order:

h-index of citations = max (min (citations[i], i) for all i from 1 to number of papers)

Therefore, we can easily implement the calculation of h-index by sorting the array in decreasing order then applying the above formula. Below is the 2-line Python code in O(nlogn) time.
class Solution(object):
    def hIndex(self, citations):
        """
        :type citations: List[int]
        :rtype: int
        """
        citations.sort(reverse=True)
        return max([min(k+1, v) for k,v in enumerate(citations)]) if citations else 0
By sorting, it takes O(nlogn) time complexity.
However, we can do much better by an O(n) time algorithm. Pause yourself for a few minutes to think about it before continue reading ^_^.

We have some simple observation here, but it really helps to improve the performance. For each number h from 0 to n where n is the number of papers, we need to find out how many citations of a paper equal h called equal_h. Based on this, we can find the number of citations values that is at least h, and no more than h. To find the number of citations value that is at least h, we take sum of equal_h[i] for i from h to n!!! And similarly, to find the number of citations values that is no more than h citations each, we can sum up equal_h[i] for i from 0 to i. And we can find the h-index by simple running from the back of the array equal_h.
So if running from the back of the array equal_h , and h is the first index that satisfies the following conditions:
equal_h[h] + equal_h[h+1] + ... + equal_h[n] >= h (*)
equal_h[0] + equal_h[1] + ... + equal_h[h] >= N-h (**)
Then h is what we are looking for.
However, we have:
equal_h[0] + equal_h[1] + ... + equal_h[n] = n
Therefore:
equal_h[0] + equal_h[1] + ... + equal_h[h] = n- (equal_h[h+1] + ... + equal_h[n])
Another note is that since h is the first element satisfies the 2 above conditions, then h+1 does not satisfies one of them, meaning that either
(1): { equal_h[h+1] + equal_h[h+2] + ... + equal_h[n] <= h
or (2): { equal_h[h+1] + equal_h[h+2] + ... + equal_h[n] >= h+1
and equal_h[0] + equal_h[1] + ... + equal_h[h+1] < N-(h+1) }
(1) suggests that:
equal_h[0] + equal_h[1] + ... + equal_h[h] >=N-h which is (**)
(2) suggests that:
equal_h[h+2] + equal_h[h+3] + ... + equal_h[n] >= h+2
This inequality will be repeated until equal_h[n] >= n , which is wrong.

So all we need is to find the first h satisfies the condition (*), and we do not need to check the condition (**).
Below are the codes implement in different languages.

I. Python code - O(n) 
class Solution(object):
    def hIndex(self, citations):
        """
        :type citations: List[int]
        :rtype: int
        """
        n = len(citations)
        equal_h = [0] * (n+1)
        for h in range(n):
            if citations[h] >= n: equal_h[n] += 1
            else: equal_h[citations[h]] += 1
        
        s = 0
        for h in range(n,0, -1):
            s += equal_h[h]
            if s>=h:
                return h
            
        return 0
II. Java Code - O(n)
public class Solution {
    public int hIndex(int[] citations) {
        int n = citations.length;
        int [] equal_h = new int[n+1];
        for (int h = 0; h<n; h++){
            if(citations[h] >= n) equal_h[n] += 1;
            else equal_h[citations[h]] += 1;
        }
        int s = 0; //we don't need check overflow here coz sum always <= n
        for (int h = n; h>0; h--){
            s += equal_h[h];
            if (s >= h) return h;
            
        }
        return 0;
    }
}

No comments:

Post a Comment