Need ideas to add to grid based project

rapt0r · Oct 26, 2008

I am doing a coll project on Grid computing . Currently i have made server/client UI's . Nodes run on different machine and there is a client which connects to these nodes and distributes some processing tasks . I have implemented very high prime numbers testing ( tested 800-900 digits and can go higher with more processing power ) . It's like you put a range in which u want to find prime numbers, client program divides range and sends to nodes which sends back results

. Other things implemented are factorial calculations of huge numbers,calculating PI (doh! ) for benchmarking nodes . Any suggestions/tips of what more should i add would be great . Basically i need to add some more tasks which can be easily distributed and i am able to show that adding more nodes is reducing total required processing time .

bosky101 · Nov 2, 2008

Hi,

Nice to see some work on distributed systems. : ) Things that immediately come to mind are those that deal with information retreivel or archiving.

1. Archiving ?

Perhaps voulnteer to convert your latex / tex/ doc files at your university to a web or pdf format. Here is an example of how the New York Times converted 11 million articles to pdf format in under 24 hours, using 100 instances under a distrbuted environment.

I then began some rough calculations and determined that if I used only four machines, it could take some time to generate all 11 million article PDFs. But thanks to the swell people at Amazon, I got access to a few more machines and churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3. (In fact, it work so well that we ran it twice, since after we were done we noticed an error in the PDFs.)

2. Mapreduce applications

distributed Mapreduce although can be spawned on parallel process's on the same machine, is ideally suited to extend to distributed nodes/machines that start performing a map function on a portion of the sample set, then a reducer that takes all the computed results , and hence the name mapReduce . IMHO, a the results that you see when you enter a search query in Google, is generated through 25-30 mapreduce operations distributed on 1000 or more machines.

Sample applications from mapreduce implementaions like hadoop, google's mapreduce show plenty of scope for the same.

Other areas where Google suggests mapreduce are

* distributed grep

* distributed sort

* web link-graph reversal

* term-vector per host

* web access log stats

* inverted index construction

* document clustering

* machine learning

or other hadoop applications that have been documented

* Aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences.

* Analyze and index textual information

* Analyzing similarities of userâ€™s behavior.

* Build scalable machine learning algorithms like canopy clustering, k-means and many more to come (naive bayes classifiers, others)

* Charts calculation and web log analysis

* Filtering and indexing listing, processing log analysis, and for recommendation data.

* Flexible web search engine software

* Gathering world wide DNS data in order to discover content distribution networks and configuration issues

* Generating web graphs

* Image based video copyright protection.

* Image processing environment for image-based product recommendation system

* Image retrieval engine

* Large scale image conversions

* Latent Semantic Analysis, Collaborative Filtering

* Log analysis, data mining and machine learning

* Natural Language Search

* Parses and indexes mail logs for search

* Process whole price data user input with map/reduce.

* Recommender system for behavioral targeting, plus other clickstream analytics

* Run Naive Bayes classifiers in parallel over crawl data to discover event

* Serve large Lucene indexes

* Session analysis and report generation

* Source code search engine

* Statistical analysis and modeling at scale.

* Storage, log analysis, and pattern discovery/analysis.

* etc

At my startup, we're running a 4 node cluster ourselves that we use for a multiple of things like webservers for load balancing traffic , simulating high loads, running a distributed web crawler, a distributed database, creating thumbnails, etc . We're using Erlang , maybe you can have a look at that as well.

vishalrao · Nov 5, 2008

bosky, looks like you're posting about data grids and raptor is trying a compute grid.

raptor: i wonder if doing a ray tracer to have clear visual impact of render speedup with multiple nodes kicking in is too much of an effort... you could search for open source

calculating/testing primes isnt very glamourous is it? heh

Need ideas to add to grid based project

rapt0r

bosky101

vishalrao

Global Moral Police