Content Discovery in Distributed Systems

Introduction

The digital exchange of information over distributed networks is not a new topic. Applications such as classroom educational tools, chat services, multicast applications, and, more commonly, electronic mail, could all be categorized as variations on this design. The now infamous birth of Napster as a quick and dirty way to access music files has reinvigorated industrial and academic activity in this arena. Core to such development activities has been issues such as disk space utilization, bandwidth constraints, effectiveness of peer discovery, group management, point of failure, and other issues. The fruit of these development efforts have been second generation services such as Gnutella, Kazaa, JXTA, and Pastry; all of which offer completely decentralized methods of file sharing.

A common trait of all such systems is that they establish their own particular precedent, leaving the application development community much flexibility to build their products and services on top. The assumption being that the precedent provided is both efficient and reliable. In this blog, we are going to look into the foundation laid by centralized networks, the effectiveness of the architecture during content discovery, and what role they play in current peer-to-peer systems.

Figure 1: Peer-to-Peer Network Topology

Given the popularity of this area of development, there seem to be as many protocols as there are claims to a solid solution. For example, Gnutella's flexible network design dramatically decreases the possibility of a single point of failure. But, as the system scales, Gnutella's method of flooding the network with queries eventually does more harm than good. Another example is Pixie, a peer-to-peer architecture that uses the concept of content scheduling to decrease the limitations imposed by network utilization. In this case, rather than flood the network with requests, the application schedules content based on the efficient use of resources. Finally, there is the recently popular hybrid networks - a combination of the decentralized and centralized architectures of companies such as Gnutella and Napster respectively.

Figure 2: Centralized Network Topology

With this latest evolution in networking approaches in mind, the goal of this project is to design a small centralized network and test how efficiently that network performs search, discovery, and retrieval methods under varying conditions. Since Napster's introduction, central indexing services, as a scalable solution for content delivery, have been largely replaced by decentralized systems that are less vulnerable to a single point of failure. However, as both research and industrial efforts continue to offer solutions to the shortcomings of both centralized and decentralized networks, it is becoming increasingly apparent that a combination of the two networks is (currently) the best solution. This is the stuff that the KaZA file-sharing network is made of. In the KaZA solution, centralized servers are located through-out decentralized peer groups, combining the fast access of a central index with the request propagation strengths of a decentralized solution.

In large part due to the renewed utility of centralized search methods within hybrid solutions, this project sets out to explore how centralized search and retrieval is accomplished and, if need be, where it can be improved. The four primary system features considered in this project are:

Node Discovery. One of the major challenges in peer-to-peer systems design is the discovery of nodes on the network. Each individual node needs a reliable method for discovering and handshaking with every other node. Additionally, information about the various nodes on the network needs to be stored in some manner. In an indexing system such as this one, a central server is used to maintain information about the nodes on the network. Each individual node must log onto the network, register with the central server, and query the server's registry to discover other nodes on the network.

Content Discovery. Content discovery is the location of files on the network. In centralized systems, this can be directly from one peer to another. In our system, the directory server acts as the adjudicating element in the network, directing the local client's requests. Specifically, requesting peers are given IP references to nodes on the network with requested files. Of interest to me in this project is the time it takes between content request and node discovery.

Content Delivery. Once the server, remote hosts, and content location is discovered, there needs to be an efficient mechanism for delivering files over the network. In many peer-to-peer systems, this piece of the puzzle is key to the overall success of the design. For centralized architectures in particular, content delivery can become an electronic thorn relative to the amount of nodes on the network. The more hosts requesting content, the more likely the single directory server is unavailable for replying to requests.

Security. Security should never be overlooked when designing any networked system. Security is especially important in peer to peer networks where both the volume of content and network nodes can be quite large. Common security problems such as viruses, encryption cracking, bandwidth clogging, internal and external network attacks, eavesdropping, and so on are all concerns when designing such a system. Additionally, as the number of nodes on the peer to peer network increases, so does the system's overall vulnerability to security breaches. For our centralized application of peer to peer, I have decided to implement a basic example of Secure Sockets Layer (SSL) from the standard java SDK.

Approach

At the heart of any discussion of efficient network design is overall topology, or, how to best connect the nodes within a group. For centralized system such as the one I considered, the topology is considered in terms of information flow of the network as a whole. The nodes in the graph are the peers and links (or edges) between peers indicate a regular sharing of information. For the network to be truly effective, the nodes should be able to use the edges to share information without unnecessarily loading the network.

How the network is designed determines how information is shared. Below are definitions of the most common network models in use today:

Centralized The architecture considered in this project. Centralized client/server systems are currently the most popular form of network - with a central server adjudicating among its client peers. Examples include web servers, databases, SETI@Home, Napster, etc.

Ring. A common method for scaling centralized services is to use a cluster of machines arranged in a ring to act as a distributed server. Communication between servers coordinate the sharing of the system state. This establishes a group of nodes that provide identical function to a single server but incorporate redundancy and load balancing capabilities. Typically, ring systems consist of machines that are nearby on the network and owned by a single organization.

Hierarchical. DNS is an example of such a system. In the case of hierarchical systems, authority flows from the root name servers to the server for the registered name and downward. Usenet is another example of a large hierarchical system.

Decentralized. Popular as true peer to peer computing, decentralized systems communicate symmetrically, where each node takes on all responsibility as both client and server. Popular examples are Gnutella and FreeNet.

Each of the above system architectures dramatically effects the overall success and usability of the network. In our centralized example, there are numerous environmental factors that can influence the overall stability and effectiveness of the system. For example, resources may suddenly become unavailable if a user decides to disconnect from the network or power-off a machine. Of course, the volume of users and their volume of use is a key concern. There are also random events such as connectivity failures, hackers, viruses, etc. (albeit, not a consideration in this project) that can influence the system's performance.

System Requirements

A good deal of design-time efforts went into this project. Requirements analysis was made up of several sections, each defining a particular functionality of the system. This project followed the following planning constructs: Analysis (the textual what, how, and why of planning for development), Use Case Diagrams (graphical descriptions of how the user interacts with the system), Sequence Diagrams (what is the particular step-by-step process of the system), & Class Diagrams (what are the classes and their methods). Other diagrams such as State, Activity, Collaboration, & Deployment were considered but deemed overkill and, in many cases, redundant.

Analysis

The analysis phase is used as a top-level exploration of the project as a whole. It follows a simple question and answer format and is intended to touch on each and every aspect of the system, from hardware and software requirements, to how users collaborate. Below is a piece of the initial Analysis phase Q and A.

Question 1: What is the intended purpose of this project? 

To build a simple example of a peer-to-peer system. In this case, a central directory server will be used (similar to Napster's model). 

Question 2: What are the particular inputs of the system?

The user should have a way of connecting to the central server and registering with the network. Additionally, he or she should be able to retrieve data about network nodes and the file contents. Accordingly, methods for searching for and retrieving files should be designed. 

Question 3: What software is required?

For this project, the j2sdkl.4.2 was used in addition to SparxSystems UML tools. Additionally, RMI was used for remote networking. RMI is included in the java development kit.

Question 4: What hardware is required?

For realistic client/server interaction, two computers would be nice but much can be done on a single computer.

Question 5: How many users of the system will there be?

For this system, a single directory server and three remote clients. 

Question 6: ...

Figure 3: An example of the question and answer exercises in the planning and documentation period. A number of key outputs of this exercise helped us get a better sense of what needed to be addressed in sequence diagrams and use cases.

Use Case Diagrams

Use case diagrams were designed to describe how the individual user interacts with the system. This exercise helped us to satisfy feedback in the analysis phase such as what are the desired inputs?, how is a file advertised? and what is being accomplished? It became clear early on that one of the key issues was going to be how Content Delivery was going to occur. Protocols such as I/O Streams, TCP/IP, and UDP would have made the job easier but, since this is an RMI project, I had to rely on remote object calls. This issue is further explored in the Implementation section below.

Figure 4:User view of the system. This use-case describes the main inputs required of a user on the system. Later, these inputs will be translated into classes and methods.

Sequence Diagrams

At this stage, the overall architecture of the project started to take shape. When I initially considered design objectives, I decided to concentrate on issues relating to peer location, content location, content delivery and security. Several early architecture ideas were designed and eventually dropped. Accordingly, the main issue I initially struggled with was the choice of protocol. Sun Microsystems's JXTA seemed like a good option initially but, once we decided to move away from a decentralized architecture to a directory server, it was no longer relevant.

Pseudo code was designed in this stage to for both the client and server implementations. The pseudo code underwent numerous revisions as the project progressed. Below is a version of the client implementation:

If (the directory server accepts a new connection)
    register client with the server/network;
    update the directory;
    initialize the peer group;
else if (the server refuses to connect)
    attempt to establish a connection n times; 
else 
    fail;

if (there is input into the user interface)
    search clients for files matching entered string;
else 
    download files from clients;

Figure 4:The pseudo code for how the client is intended to interact with the system. A number of such pseudo code examples were designed and updated - later informing sequence diagrams and class and method design.

Both the Use Case Diagrams and the pseudo code laid a good foundation for the development of the several sequence diagrams that were built to help us describe the steps that would be taken during the exchange between client and server. A good example of the flow of events can be seen in Figure 5.

Figure 5.Sequence flow of the central indexing architecture. The local host queries the remote directory server for the location of a given piece of content. The server replies with the IP location of the node containing the text file. Local host then handshakes with the remote host to receive the content.

Implementation

The work completed is outlined in the following sections. The implementation of the RMI client/server was informed by the use cases set up in the requirements. Classes and their associated attributes and procedures were designed based on the information gathered during the planning and requirements gathering phase. A first draft of the Server and Client implementations were designed. Because RMI transparently accomplishes a good deal of the work involved in setting up a file sharing system, it made much of the design time efforts straight forward.

I make a few assumptions during implementation. I assume that only one peer is accessing the server at a time and that queries to remote methods are not happening in tandem. I also assume that the same is true for similar activity on the system such as file requests and downloads. Finally, I have designed a small network and assume the reader understands that the results will not be the same should the number of nodes or overall usage increase.

Classes and Methods

Classes and methods were designed during the implementation phase to transport method calls from the local GUI interface to the remote server. When I designed the methods in my client and server classes, they essentially followed the below conventions:

Derive an interface from java.rmi.Remote that contains the methods to be made available to RMI clients.
Define a class that extends the appropriate subclass of java.rmi.server.RemoteServer. In our case, this class is UnicastRemoteObject.
Implement the derived interface in the derived class.
Use javac to create class files.
Create stub and skeleton classes with the JDK rmic utility, and make the stub classes available to the client and servers.
Start the RMI registry on the local machine.
Start the main application, which should instantiate the RMI server class and register it with the local registry.

Originally, the system was designed with a command-line interface but a GUI was added later on in the development process. On the server side, Start() and Stop() methods were designed for allowing clients to register and deregister with the server. As information flows to and from the server, an Update() method updates the directory of files.

Figure 6. Design of the system's core class diagrams.

The startPeers() method is provided as a callback from the server to the requesting client. It calls this method back on the client and passes its references to the other peers on the network. Once this happens, the receiving peers hold the node information in an array and searches this information when necessary. The generic actionPerformed() method is used here to handle all the inputs from the GUI:

Search Nodes on the Network. To do this, the client user enters the getPeers() string in the command line. This puts a call to the server to get a new set of peers for the client.

Search Files on the Network. To do this, the client calls the remote file search method on all peers in a client's peer array and put the results in the list object ofthe graphical interface.

Download a Located File. To do this, the client user calls a getFile() method on the peer it needs to receive the file from. The results are written to a byte array on the local client.

Additional methods are used to return basic information to the requesting client. They include writeFile(), which writes the contents of the byte array contents to a file. A getFile() is called by the remote client to return the byte array of the contents of the file that is requested. In other words, if a client wants to download a file called A_Tale _ of_Two _ Cities.txt, it must call that method on the providing client.

The Client

After the classes were designed and revised, the next step was to implement the core code for the local client. The client design is declared in a public interface that extends remote objects. Accordingly, this interface extends java.rmi.Remote and its methods are declared to throw RemoteExceptions to the server.

As each new client joins and leaves the network, it calls the methods designed in the Client interface class. Additionally, fileName and file Size strings were added to store information about the files and file sizes located on the other clients on the network. The class takes the form:


package network;
import java.rmi.*;
import java.io.*;

public interface Client extends Remote {

String [] filenames = new String[99];
long [] fileSizes = new long[99];

public abstract void initPeers(Client clientA; Client clientB; Client clientC) 
throws java.rmi.RemoteException;

public abstract void getHost() throws java.rmi.RemoteException;
public abstract void listFile(int i) throws java.rmi.RemoteException;
public abstract void getNumFiles() throws java.rmi.RemoteException;
public abstract void getIP(string searchString) throws java.rmi.RemoteException;
public void writeFile(string filename) throws java.rmi.RemoteException;
public byte[] getFile(string filename) throws java.rmi.RemoteException;

}

Figure 7. The public interface for the client.

The interface's purpose is to mark derived interfaces that contain methods to be exported by the remote RMI Server. These method calls were designed to find out as much as possible about the surrounding network environment - importantly information about the location of other nodes, getIP() and getHost(), and information about the files each and every node has in its local directory. This is done via the listFile, getNumFiles. Finally, methods to download the file are getFile() and writeFile(). These last two methods later proved to be somewhat problematic. Namely, the getFile() byte array. I will discuss reasons and solution further down in the paper.

Once the public Client interface was designed and tested, the main client implementation class - ClientImpl - was coded. This class contains the actual logic of the designed methods and procedures in addition to the relationship to the graphical user interface. The core functionality of this class is its ability to register each new client with the network and ultimately make its information available to the other clients. Other key algorithms involve the deregistering of the client and the method that updates the array of file names and sizes on the current listing. This is the updateDir() method.

This file is too large to demonstrate here, but one method of note is the getFile() method that reads from filename in the public interface to an array of bytes. The contents of the file is assigned a location in memory by declaring a new temporary file location. The temporary byte array location is later written a new file in the local disk of the requesting node. The essence of this process is detailed in the below steps:

 { 
byte[] temp = new byte[l];  byte[] contentsl;  
File inputFile; 
try {  
inputFile = new File (filename) ;  
size = inputFile.length();  
contentsl = new byte[(int) size];  
FileInputStream in = new FileInputStream(inputFile); 
} 
return temp;  
}

Figure 8 . Design the system's core class diagrams.

The Server

The remote extension of the server class is, by comparison, not as complex as the design of the client classes. Essentially, the server was designed to simply do the following things: register and deregister clients, store information about client activity, and give information to requesting clients as need be. Accordingly, the Server classes methods are register, deregister, and givePeers.

In the implementation of the Server class, a vector array was provided that keeps information about client activity. When local clients query the server, the server searches its vector array to provide information about what clients are currently active on the network. The code for this functionality takes the form:


protected vector clients;

public ServerImpl () throws RemoteException {
clients = new Vector ();
}

Additional efforts were made to determine how long objects were taking to execute on the server-side. The two key operations for the remote timing of events were, first, the initialization of a timer at the local node that executes on the server. The timer terminates once the result is returned from the remote server. The data from this operation helped me gather more information about how the network was behaving in different environments @such as increased traffic or the transmission of different file sizes.

The server code that tests the start and stop time of a remote call should be distinguished from the Timer package - which attempts to gauge the length of remote execution. To do this the local java code makes a call on the remote object being implemented and returns a date and time in association with the object being acted on. An example of one of the class diagrams in this package takes the form:

public class Timer implements Runnable {

TimeMonitor tm;

public Timer(TimeMonitor tm) { 
   
    this.tm = tm; 
   
    public void start() { 
       (new Thread(this)).run();
       public void run() { 
          while (true) { 
            try { 
              Thread.currentThread().sleep(lOOOO); 
            } 
            catch(InterruptedException x){}
            if (this.tm!=null) 
            { 
               try { 
                  this.tm.tellMeTheTime(new Date());
               } 
               catch(RemoteException x){}
            }
         }
      }
   }
}

Figure 9.The above class example remotely queries the server and returns a time stamp.

The purpose of the this code (in addition to its related classes) is to remotely start the execution of a timer on each new thread that is executing on the server. When the thread has completed its execution, the current system time is returned as a result. The end result is that I get a sense of how long the various methods take to execute on the server.

Security

Security in networking, and particularly in large peer-to-peer applications, is an important topic. Because this project is about the effective sharing of resources, the is a lot of potential to deliver harmful material to any of the nodes on the network. In larger networks this topic of research is crucial to the vitality of the system.

In my particular RMI client/server program, the intention was to decrease the flexibility enjoyed by local clients when invoking remote classes. Otherwise, any client program could run any server object, some of which could be potentially harmful to the network. When researching solutions provided by RMI, the answer I decided to use in my implementation was to install a security manager. Without the installation of a security manager there are no restrictions placed on how remote objects are accessed and by whom.

I used the java.rmi.security classes to quite simply instantiate the security manager with the below statement:

if (System.getSecurityManager() == null) 
{ 
   System.setSecurityManager(new RMISecurityManager());
}

In addition to the above lines of code, the Java SDK that I was using for this project required that a security policy file be specified at runtime. This is done by defining the java.security.policy property:

java -Djava.security.policy = mypolicy

In order to access remote objects on the system, Java looks for a system-wide policy file in its run-time library. It also looks for a local policy file in the home directory of each requesting client. A sample policy file that grants full access permissions to everyone looks like:

grant {
    permission java.security.AIlPermission;
};

Through the use of its local policy file, each client on the network can grant permissions to each other node on the network. This exchange is made possible by the Permission classes in the java.security package, which provides access grants to specific resources.

Evaluation

Presented in this blog are the results of three experiments. In all cases, the intent is to learn something about the strengths and weaknesses of centralized search methods. To do this effectively, I use the algorithms designed in a Time package that help to gauge how long an action event takes from the time it is initiated to the time a response is returned from the server. Data is collected on performance tests on each of the three areas of interest mentioned above: node discovery, content discovery, and content download. In each case, the load placed on the system is equal to that performed by 50 simultaneous queries performed by each of the three clients.

System tests were completed over a period of one week. Tests and resulting data was not gathered consecutively as research has suggested that this can cause results to vary significantly. Later on in the testing phase, results were compiled and explored.

Node Discovery

Of primary interest when implementing the Timer class is the location of additional peers on the network As each new node comes onto the network, it registers with the server. The server's chief task is to keep track of all the nodes currently logged onto the system and give references to requesting peers. To keep information updated, any given node can, at any time, contact the server to request an updated list of information about the other nodes.

When a new client comes onto the network, it initializes the startPeers() method and remotely invokes the server's givePeers() method. Information about the other nodes on the network is then loaded into a local array. The initializes method on the client side takes the form:

public void getPeers (Client clientA, Client clientB, Client clientC) {
     clients [0] = clientA.getHost(); 
     clients [l] = clinetB.getHost(); 
     clients [2] = clientC.getHost();
}

Figure 10. The startPeers() method on the client side invokes the remote givePeers() method on the server. A resulting list of connected nodes is delivered to the client.

It is the first objective of this project to study the effectiveness of this peer discovery interplay between client and server. Data is collected by timing the initialization of the getPeers() method, the remote invocation of the givePeers() method, and the final response from the server. The sequence of events takes the form:

$ getpeers

client sent time...
Time: Tues Aug 02 12:11:09 CST 2003

server received time ...
Time: Tues Aug 02 12:11:10 CST 2003

client returned time ...
Time: Tues Aug 02 12:11:11 CST 2003

Figure 11. Functioning of the timing sequences. Commands issued on the local client, the execution of remote methods, and returned results are all time stamped to issue data points for further exploration.

As Figures 8 and 9 indicate, I tested the getPeers() method call in an environment where only one client was querying the server and then, in an environment where three clients where simultaneously querying the server. In the case of Figure 7, the overall average of the combined data points for each client was 1.562. In Figure 8, the single client's average was 1.262. While it wasn't surprising that the increased load of the Figure 7 resulted in a larger overall average, I did expect the numbers to be further apart.

A second observation was in the difference in fluidity between the two figures. In Figure 7, a somewhat erratic behavior is observed that is not so (or not at all) present in its counterpart. I am guessing that this observation can possibly be attributed to the three clients competing for the same method invocations on the remote server. This touches on the issues inherent in concurrent systems programming mentioned earlier. For me, the question that Figure 7 raises is whether or not RMI is thread-safe. There are a lot of possibilities here. One such possibility is that the connections are being pooled in such a way that only one is being used by an outstanding remote call at a time. Just because the stub never modifies any instance data does not mean that concurrent calls writing to the same socket will marshal correctly. Another possible explanation is the actual activation of the remote objects. In Sun's documentation it was unclear to me how to tell whether a remote object is in an active or passive state when being accessed. Without clarity here, it is possible that the graph below reflects multiple threads trying to spawn multiple processes for the same activation group - in this case the givePeers() method.

Content Discovery

The way in which content is stored and advertised on a network can dramatically influence the effectiveness of its associated search methods. The advantage of a system with a centralized directory is that it is possible to quickly gain access to information about which nodes contain which files. Systems such as KaZaA use this fact to its advantage by combining a fast directory lookup node, or supermode, with the propagate power of decentralized systems.

Fast access to content references initially happens when the client registers with the remote server. As each new node registers and deregisters with the remote server, the updateDirO method updates the array of file names and sizes with the current information. The local client then stores that information as an array in its local directory:

String[] filenames = new String[99]; 
Long[] fileSizes = new long[99];
//stores file list 
//stores file sizes

The metrics I used to explore the content location qualities of centralized systems are outlined in the below charts. Two approaches were designed. In the first test, each peer simultaneously searches for a different file on the network. In the second approach, each peer is simultaneously searching for the same file on the network. The focus of these two different tests is, in general, to gauge the overall time it takes to locate a file on the network and how well file location performs under increased load conditions.

As observed by Figures 13 and 14, there is no real significant differences between multiple clients accessing the same file and multiple files being accessed by multiple clients. I did expect to see some variation in the results. There is, however, some notable spikes in Figure 9's activity. Whether or not I can attribute these small increases to issues such as "thread safety" or the remote activation of objects is hard to say - although I doubt it. It is more likely that these notables are due to slight variations in the results. As a side note, it is important to point out that, of the files searched, the majority of the "desirable" files (likely the ones I was querying the most often) were located on a single client. In terms wait time resulting from competition for resources, this observation (on a small scale, anyway) doesn't seem to have much effect on the system's activity.

A second test was conducted to rank the expected results from content searches based on the popularity of the content. Unfortunately, I don't have the luxury of a large network used by a diverse group of users with varied interests to get a truly random sampling of how popularity may influence usage of the network. As a next best solution, I gave each file on the network a "popularity ranking". This was done by assigning values from 1 to 10 (10 being the highest) to each of the ten files on each for the three clients being tested. As the remote method invocations request files at random, the final results hope to give us an idea of where traffic might be directed within the network. The results of this test revealed:

Content Delivery

This section of my project explores how content delivery behaves in a centralized network environment. Namely, I evaluate how the flow of information happens from one node to the next. To get a sense of how efficient this type of information exchange is in our small network, the two tests I used evaluated 1) multiple file downloads (on various small files under 2 MEG in size) happening at the same time and 2) a multiple downloads of a single large file (20 MEG in size).

Once a file was located on the network, the actual transmission from one node to the next proved to be a bit more complicated than I anticipated. After researching solutions such as TCP/IP and 10 Streams, it seemed that the best method for RMI file transfer was to read the files contents into an array of bytes on the remote client. To do this, I followed the following sequence of events:

Instantiate the remote object
Open the file and get its size
Allocate the byte array and read the file into that array.
Copy the file name.

Once these steps were accomplished, the remote file object could be transferred by calling the getFile method on the remote client. This method call fills the local clients array of bytes with the bytes from the remote file. Then, the local client calls the writeFile method which writes the contents of the byte array to a file.

As observed in Figure 12, initial large-file transfer tests yielded poor results. In most cases, the transfer of a large file proved too memory intensive and the system simply hung. After exploring this problem, I discovered that this wasn't a short-coming of the centralized architecture design but how the writeFile method call was writing bytes into the local array. The (short-term) remedy to the file was to cut the client file into byte 5 MEG byte arrays and transfer the file as a sequential series, reassembling on the receiving end.

Observing results

The network characteristics of centralized systems were studied with peer location, content location, and file download effectiveness in mind. In our fist test, I tested a single client's ability to invoke remote method calls on the server to get a listing of connected peers on the network. The metric used is the wait time between the execution of the command and the returned results. In the fist case, I found that that average wait time is roughly 1.562 in an environment where load is being placed on the server. In the case where one node is accessing the remote server to get peer information, the wait time is comparatively less - 1.262 - as might be expected.

The ability for a local client to quickly find the location of files on the network was shown in the second exploration. In both cases - multiple nodes querying the same and different files - the response time was immediate. As an added observation to this test, it is interesting to note that the percentage of responses to requests maintains a high and predictable level. At no one time was a request for a file rejected by the remote server. A second test was added to the subject of content location - popularity. The popularity of a particular file (or group of files on a particular node) on the network can dramatically influence how activity is distributed. In our test a sampling taken at random shows that that overall load of the network was weighted towards Client A. In such a case, especially if the popularity of files is proportionally small on the surrounding peers, a bottleneck could possibly occur. As observed by the weighted results in Client A, an effective solution to evenly distributing how and when remote objects are invoked by connected clients is an important consideration when dealing with a centralized system.

Finally, the effects a file retrieval has on the system was tested. In both cases, all three peers in the network were transferring a relatively small file - under 2 MEG - and then a larger file of20 MEGS.

Conclusion

I have spent a good deal of time testing the various strengths of searching a centralized network and have found that such a network can be both powerful and powerless depending on what you are demanding of it. Centralized directory servers are a very powerful tool for providing fast references to remote locations on the network. This fact is certainly a valuable commodity in large, multi-node environments where multiple files are being shared. The data points under peer discovery and peer delivery sections certainly back up this finding. On the other hand, I found content delivery to be a problematic for the reasons stated above. I don't feel this is the result of a centralized environment but, rather the shortcomings of remote method invocation, or RMI. Certainly, my earlier solution of cutting up my files into segments of byte streams could be solved with sockets or some other such solution, but, the issue of propagation makes RMI a poor solution for large-scale file sharing environments centralized or decentralized.

There are a number of improvements I would likely make to this project, given more time. As mentioned earlier, one of the primary problems of centralized systems is that they are not as efficient in propagating requests throughout the network as their decentralized counterparts. The problem is that a remote peer cannot send unrequested data to a client doing a search. The remote peer can only send data when it is explicitly called for by the requesting client. This inflexibility of the centralized RMI approach makes it quite impossible to seamlessly share information throughout the network. To illustrate the point, consider Client A. When Client A wants a file, it makes a call to the remote Server, requesting information about the other connected nodes. If Client B has the requested file, then Client A makes a direct connection. But, what if the file does not exist within Client A's peer-group, but is available somewhere else on the network? In an ideal situation, Client B could be able to refer the requesting client to another node on the network that does have the file. This is the essence of the JXTA protocol. JXTA uses the notion of advertisements and a peer discovery protocol (PDP) to fluidly locate references to information throughout the network. Advertisements are essentially messages represented as XML that make available information stored in a given peer's cache - such as other peers, peer groups, or available local or remote content. When a peer attempts to discover a particular piece of content, it searches the referring advertisements until a reference to the correct node is found. The efficient propagation methods of the JXTA protocol are not possible in an RMI environment where information exchange is a one-to-one dynamic.

A second improvement would address the problem of concurrency within distributed systems. The way RMI currently works, a method dispatched by the RMI runtime to a remote object implementation may or may not execute in a separate thread. The RMI runtime makes no guarantees with respect to mapping remote object invocations to threads. As a result, when an RMI server is written, any assistance in executing separate threads must be hand coded. This introduces a degree of complexity that, although I did not have time to address it in this project, is crucial to any system that entertains the possibility of numerous simultaneous client requests (such as a multi-user file-sharing system).

Another improvement I would like to add, time allowing, would be the inclusion of additional tests for each of the three subject groups. I feel that additional variations could be done on system load testing. For example, in the case of the content discovery tests, it might be interesting to explore how the system behaves when content is evenly distributed throughout the network versus unevenly located in only a few nodes. Another such improvement might be a decent stab at building a system that manages to propagate requests from node to node. Given the focus of this project, I was not able to invest too much effort in finding a good solution to this problem. None-the-less, propagation is the shortcoming of centralized systems (and the advantage of its decentralized kin) and the design of an RMI system that intelligently tackles this problem would certainly be interesting. Finally, it would be useful to determine how tests of file search, discovery, and retrieval behave as the size of the system scales. While a three node network is useful for the purposes of this blog, a realistic performance test would require a more robust architecture.

Peter Dushkin

Content Discovery in Distributed Systems

Blog Archive

Labels

Report Abuse

Contributors

About Me

Topics

Subscribe to my blog

Get updates in your Mailbox