More details here
Author: OLeg Lodygensky
XtremWeb-HEP 7.5.0 deployed at LRI
XtremWeb-7.5.0 released
More details here
Previous versions
- Sept 1st, 2010 : XWHEP 6.0.2
- Corrections
- a bug corrected in
xwtar.sh
script which generates the sources archive; this archive is now correctly generated.
- a bug corrected in
- Corrections
- July 19th, 2010 : XWHEP 6.0.1
- Corrections
- this version introduces scripts that check and update the database since they are missing in 6.0.0
- Corrections
- July 8th, 2010 : XWHEP 6.0.0
- Corrections
- a bug corrected on server side concerning downloading data : if data file is not available on server side, distributed parts are now cleanly informed
- a bug corrected on server side concerning work alive signal (on finished tasks)
- a bug corrected in install packages concerning “adduser” usage
- xwversion returns 1 if client should be upgraded
- a bug found in data management (WIN32 worker)
- xwconfigure introduces –newkeystore and –newalias command line parameters. By default keystores are created and unmodified if already exist.
- –newkeystore can be used to force keystores regeneration (if already exist). Doing so cancels any deployed paltform : clients and workers must be redeployed
- –newalias can be used to insert a new alias in existing keystores.
This helps to keep a deployment before keystores expire (this can then be used in conjunction with keystore.uri variable in xtremweb.server.conf)
- New features
- client RPM and Debian packages
- Host definition contains new columns
- -osversion
- -javaversion
- -javadatamodel
- App definition contains necessary columns to manage x86_64 for Win32
- Use Java 6 SystemTray
- Corrections
- June 30th, 2010 : XWHEP 5.9.1
- Corrections
- a bug corrected in SQL statements
- Corrections
- April 28th, 2010 : XWHEP 5.9.0
- Corrections
- Mac OS X installers updated : it is now compliant with all versions (including 10.6)
- XML (un)marshalling from/to stream cleaned : performances increased at more than 20% (see below)
- Removed scripts : xwhepgenserverkey, xwhepgenworkerkey, xwhepgenclientkey. Replaced by script : xwhepgenkeys. This last is generated by xwconfigure
- New features
The platform is now able to update the server public key. This works if and only if the server key has not expired yet.
- Purpose : keep a deployment alive even if the server keys expires
- What: distribute new server public key (automatically) to workers and (on demand) to client
- When: before the actual server key expires
- How :
- use the script xwhepgenkeys to add new key to keystores
- insert worker keystore (containing both actual and new server public keys) as public data in XWHEP server
- stop the server
- set keystore.uri variable in xtremweb.server.conf
- restart the server
- Security : This does not break security since
- workers and clients must have the current server public key to connect to server
- updated keystore is then distributed through encrypted communication channel
- workers and client must have valid credentials to connect to server
- workers and client safely keep keystores
- distributed keytore contains server public keys only
XML un/marshalling has been improved. This results to an 20% performances increase as shown in the two next figures which represent the client execution time to retreive 1000 jobs from server from our XWHEP deployment at LAL. In the first figure we can see that the client needed more than 100 seconds using XWHEP 5.8.0; in the second figures less than 80 seconds is now needed using XWHEP 5.9.0.
- Corrections
- Mar 5th, 2010 : XWHEP 5.8.2
- Corrections
- a bug corrected in init scripts
- Corrections
- Feb 26th, 2010 : XWHEP 5.8.1
- Corrections
- the script xwhep.bridge.pl has been corrected and is now compatible with 5.8.x versions
- Corrections
- Feb 24th, 2010 : XWHEP 5.8.0
- Corrections
- the script xtremweb.tests.pl has been corrected
- worker : a bug corrected on result uploading
- server : the scheduler can be defined in conf file and dynamically loaded at run time
- the scheduler has been totaly rewritten : it was not only buggy and conceptually erroneus, but also not fair. In this version, the scheduler :
- does not cached DB rows any more since this was source to memory leaks
- improves SQL request usage so that it retreives expected rows only
- is now entirely responsible for task management
- retreive() retreives WAITING task from DB and set them to PENDING
- select() retreives PENDING task on worker request
- but still does not fairly manages jobs among users
- if you delete an user, you can reuse its login when inserting a new user
- if you delete an user group, you can reuse its label when inserting a new group
- user passwords are never sent over the network
- processor X86_64 correctly manage for Apple workers
- memory leaks resolved on server side. Next figures show server memory consumption. The 1st one shows version 5.7.7 starving behaviour: after two days, the amount of allocated memory is as up as 600Mb. The 2nd one shows a more stable consumption with version 5.8.0, after two days : the amount have been very stable at only 288Mb.
- New features.
Some new features are implemented to improve confinement and security access.
- user rights slightly modified : ADVANCED_USER allows user group management. Consequences : for each user group, we must define user group administrator who has enough rights to manage its own group (users, applications)
- Any object can now be confined : they all have OWNERUID and ACCESSRIGHTS (AR) fields
- xwchmod now applies to any object
Major consequences :
- administrator and worker identities are private (AR : 0x700) : they are not listed by standard users
- we can confine user group by setting its AR to 0x750
- if a group is confined, an administrator of the group MUST be inserted
- any user inserted in a user group inherits AR from its group
- users MUST be inserted in a group by the group administrator, if any (the user owner MUST be the group administrator, if any)
- users in a confined group are listable by user group members only
- the client can help to insert an user group and its administrator by:
$> xwsendusergroup MyGroupLabel theGroupAdminLogin theGroupAdminPassword theGroupAdminEmail
If user group is not confined, everything is as it used to be in previous versions.
- This version enables versionning. Here is how. All messages (requests, answers) are now encapsulated in an XML root element
...
So that version is included in the protocol.
In top of that some fields have been deleted, added, renamed
(see -B.2- and -B.3-)
All Object now have OWNERUID and ACCESSRIGHTS fields
Some object had CLIENTUID or USERUID; these have been renamed OWNERUIDBut, of course, versions prior to 5.8.0 know nothing neither about that XML root element, nor about column changes. This current version takes care of all that. If a distributed part sends a message without the XML root element, this clearly mean that this part run a version prior to 5.8.0.
Hence we don’t encapsulate the answer with the XML root element
and we take care to answer with the expected attributes.
- Corrections
- Decembre 20th, 2009 : XWHEP 5.7.7
- Corrections
- workpoolsize is forced to 1 because we encounter some file concurrent access exceptions with workppolsize > 1
- a bug corrected in Executor streams
- a bug corrected on aborted tasks management
- New features
- -NONE-
- Corrections
- Decembre 1st, 2009 : XWHEP 5.7.6
- Corrections
- communication TCP layer : a bug resolved on reconnection error
- XWHEP 5.7.5 introduced the notion of limit entry in cache (set to 10K entries), but the implementation was buggy : this is corrected. The cache now manages its entries implementing least recently used (LRU) management
- server : detecting lost task is not done every 15s (which is just absolutly not necessary), but every ALIVETIMEOUT only (default : 5mn). This greatly reduces CPU consumption.
- a bug corrected on accesslogger : it only opens one file (and not one per thread)
- xtremweb.gmond.pl scalability improved
- certificate has no validity limit untill further notification (this helps deployment)
- xwconfigure corrected : config files generation corrected
- linux installers corrected
- scripts corrected : “type” usage improved
- the scheduler refresh its tasks list with PENDING ones only
- New features
- -NONE-
- Corrections
- 15 octobre 2009 : XWHEP 5.7.5
- Corrections
- Launcher improved : it does not start if it has a wrong SSL key, or if URL is malformed
- Launcher correction error on library path
- worker improved : error messages are more detailed
- worker corrected : on network failure when uploading result, results were lost and the worker where unable to request any new job. The worker now correctly recover and retry to upload results as well as to get a new job.
- xwconfigure path bugs corrected
- xwhep.bridge.pl modified: it can now connect to the DB in order to gain performances
- xtremweb.gmond.pl : MySQL connections management corrected
- a bug resolved on server side regarding MySQL connections usage which lead to too many TIME_WAIT sockets
- a bug resolved on I/O library usage : all handlers are now correctly closed on exceptions/errors
- a bug resolved in client cache : cache can not exceed 10K entries
- scripts improved : it is preferable to use ‘type’ and not ‘which’ in bash/sh scripts
- on client side, xwchmod now accepts UID too (and not only URI)
- build corrected : it now correctly generates x86_64 worker
- we don’t use String.intern() any more because we had some “OutOfMemory in PermGen” exceptions, on server side (see http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html)
- admin and worker are not inserted in any group (because this is not used and may be confusing)
- New features
- none
- Corrections
- 31 july 2009 : XWHEP 5.7.4
- Corrections
- host registration improved : we want to also store the worker version
- a bug corrected in the scheduler : it did not run over all tasks, but stopped at the first non compliant one
- xtremweb.ganglia bugs resolved :
- forked processes now correctly exit;
- use isdeleted flag from DB tables
- xwhep.bridge.pl corrected and slightly modified: it is not an error any more that a pilot job don’t process anything (see XWHEP-5.7.2, point 8 below)
- bugs corrected for the Win32 worker regarding path and java -Xrs option
(see http://java.sun.com/j2se/1.4.2/docs/tooldocs/windows/java.html) - a bug corrected in the launcher : it must not delete JAR files
- New features
- some SQL scripts added to check the platform through the DB
- all needed scripts to update DB from previous versions
- Corrections
- 9 july 2009 : XWHEP 5.7.3
- Corrections
- a bug corrected in the database definition
- hosts.ipaddr and hosts.natedipaddr length is now 50 to comply to IP V6 (20 was not enough)
- we introduce a new table “Version” which contains XWHEP version
- we introduce pilotJob field in hosts table to ease monitoring
- the xwsetupdb.sql script alter tables accordingly
- the xwconfigure script make these modifs non destructively
- bug resolved on dates stored in DB (insertion date, completion date etc.)
- a bug corrected in the database definition
- Corrections
- 17 jun 2009 : XWHEP 5.7.2
- Corrections
- a bug corrected in the bridge
- Linux RPM and DEBIAN, and Mac OS PKG now log in /var/log
- configuration variables are now all trimmed (leading and trailing whitespaces are removed)
- configuration variable KeyStore was not correctly initialized KeyStore can either (and now correctly) be read from config file or from java parameter -Djavax.net.ssl.keyStore
- a bug corrected in MySQL error handling on XWHEP server side
- the usage of job group has been corrected
- a bug corrected on the scheduler
- X509 proxy certificate usage modified accordingly to EDGeS meeting (Jun 2009).
The client can now automatically checks if $X509_USER_PROXY environment variable is set so that the XWHEP “–xwcert” is not required any more.
If using “–xwcert”, the certificate must be valid otherwise the job is not sent.
If $X509_USER_PROXY certificate is not valid, the job is sent without any certificate. Any worker can download a job that has been submitted with an X509 proxy (and not only Pilot Jobs).
Pilot Jobs are private workers: they can run any job of their owner.X509 proxy are never downloaded by workers (even Pilot Job).
The bridge may download X509 proxies (and the bridge only). - A bug corrected on communications handler to integrate new protocol (here ADICS)
- New feature
- XWHEP now integrates ADICS protocol by Cardiff University.
See http://www.p2p-adics.org.
This is an “external” new feature :- new protocol integration exist in XWHEP for long
- we then stay in 5.7.x branch
- XWHEP now integrates ADICS protocol by Cardiff University.
- Corrections
- 25 mai 2009 : XWHEP 5.7.1
- Note : this version is noted 5.7.1 and not 5.6.4 because 5.6.3 introduced a new feature and should have been noticed 5.7.0.
I correct this now only. - Corrections
- xwconfigure now loops on each variable until a good value is provided
- Tar file creation error corrected:
*ps files were excluded from tar file which was an error because
xwapps, xwgroups etc were excluded from tar file - Mac OS X worker scripts bug solved
- Mac OS X installers errors solved. These were due to the fact that NetInfo is not supported since Mac OS 10.5. Directory Service tools usage now replaces NetInfo ones in our scripts. This is 10.4 and 10.5 compliant. Tested on 10.4.11 and 10.5.7.
- Win32 path management bug solved
- Win32 innoSetup project errors solved
- Note : this version is noted 5.7.1 and not 5.6.4 because 5.6.3 introduced a new feature and should have been noticed 5.7.0.
- May 7th, 2009 : XWHEP 5.6.3
- Corrections
- To avoid confusion between XtremWeb and XWHEP, that last now installs everything in /opt/XWHEP-server, /opt/XWHEP-worker and /opt/XWHEP-bridge. XWHEP packages (Debian, RedHat, Apple) install XWHEP-something packages and not xtremweb-something package. All previously existing files (scripts, configuration files etc.) keep their “old” names (xtremweb.something) to not disturb those who have already developed over XWHEP.
- default socket timeout set to 0
- HTTPLauncher, which try to find last JAR version and start the worker has been corrected: it now uses the last JAR on local FS; it now stops immediatly if it can’t write downloaded JAR to local FS ;it has been reported that “-server” java option is not available on all JVM; HTTPLauncher then first tries to launch the worker with that option. On error, it retries without that option.
- On server side, inter threads deadlocks corrected on communication layer that lead to “unreachable” or “timed out” communication errors on workers and clients sides.
- A bug corrected on server side : it was impossible to reuse an application name, even if the application was deleted
- The worker now stores its own UID to its config file so that it does not appear several times in server DB
- installer corrected (RPM, DPKG, Apple PKG) because FS tree must belong to worker so that it can write downloaded JAR, if any
- New feature
- The worker can now manage dynamically linked application (and not statically ones only)
- Corrections
- Apr 6th, 2009 : XWHEP 5.6.2
- a bug corrected on communication layer regarding URI definitions which lead to connection problems.
- Mar 12, 2009 : XWHEP 5.6.1
- the binary distribution has been corrected and augmented; it prepares RPM, Debian and Apple packages for the server, worker and client;
- a bug corrected on client side regardind job datas.
- Feb 16th 2009 : XWHEP 5.6.0
- this version reintroduce the launcher to help deployment and upgrades;
- this version introduces "binary package".
- Feb 4th 2009 : XWHEP 5.5.0
- a memory leak bug solved on server side.
Next figures show server memory consumption. The 1st one shows version 5.4.0 starving behaviour: after four days, the amount of allocated memory is as up as 382Mb. The 2nd one shows a more stable consumption with version 5.5.0, after six days : the amount is a third of 5.4.0 memory usage.
- Jan 14th 2009 : XWHEP 5.4.0
- a bug corrected on I/O layer
- Dec 16th 2008 : XWHEP 5.3.0
- on server side, event dates are now corrects (submission date, execution date, completion date etc.)
- a bug corrected on the bridge.
- Dec 4th, 2008 : XWHEP 5.2.0
- a bug corrected on server side, on communication handling : handlers does not hang anymore.
- Dec 1st, 2008 : XWHEP 5.1.0
- the EDGeS XWHEP to EGEE bridge now periodiaclly sends a signal to XWHEP servers to facilitate global monitoring;
- a new script “xtremweb.jra2” has been implemented to monitor known XWHEP servers; this script has specifically been developed for EDGeS JRA2 activity;
- a bug corrected on server side, on communication handling; it seems that synchronized methods management is JVM implementation dependant;
- client scripts cleaned;
- worker configuration page can be used to manage a local pool of workers.
- Novembre 21st, 2008 :
- EDGeS bridge from XWHEP to EGEE allowing EGEE resource usage is operational; it is monitored from here.
- Novembre 17th, 2008 : XWHEP 5.0.0
- server uses a connection pool to avoid memory starvation. Server can manage up to 500 simultaneous connections. Above the pool size, incoming connections are pending for an available handler;
- client has a new option –xwxml to provide an XML description file;
- communication layer has been simplified : there is only one send message for all object kinds;
- EGEE bridge has been stabilized.
- Octobre 16th, 2008 : XWHEP 4.1.0
- a bug corrected on worker side regarding job directory setup
- Octobre 15th, 2008 : XWHEP 4.0.0
- client can connect to different servers :
- client does not include any passphrase and does not code passwords;
- passwords are not coded any more in config file, nor in database;
- this does not introduce security hole since communications are encrypted; it is the user responsability to ensure config file security;
- this lightens compilation which does not require SQL access any more.
- a bug has been corrected on worker side, regarding data download;
- notions of groups and sesssions are (re)introduced.
Groups and sessions aggragate jobs.
Sessions are automatically removed on client disconnection (client disconnects at shut down or user switch); - there is a bug on the client GUI : deleting and downloading several rows is now disabled. This is due to a bug in table sorter; we don’t correct that since Java 6 introduces native table sorters. This will then be corrected when our package will be ported to Java 6;
- our package is now Java 6 (even if we don’t use Java 6 specific features -see above) and 64Bits compatible.
- client can connect to different servers :
- Sept 25th 2008 : XWHEP 3.1.0
- in the configuration file, the SLKeystore variable can now contain a relatif path;
- resource owner can open
http://localhost:4324
to configure their worker; - cache management improved and lightly modified :
- in general, informations stored in cache are not downloaded from server. There are three exceptions: works, tasks and hosts are always redownloaded from server since these informations are subect to change often;
- client keeps its cache from run to run. A new command is introduced (
xwclean
) to clean client cache; this command is also available in the Comm menu of the GUI; - the worker cleans its local disk on shut down. Hence, the worker does not keep cache from run to run.
- a bug solved on server side : memory consumption more stable.
The JAR file is now also provided. To update you have to
- copy the JAR file in lib directory and restart your server
- copy the JAR file where
launcher.url
, in worker config file, points to; on next reboot workers will automatically download it.
Next figures show server memory consumption. The 1st one shows a starving behaviour: after 1H30 only, the amount of allocated memory is as up as 78Mb. The 2nd one shows a more stable consumption : the amount is still the initial one at only 31Mb after two hours.
- Sept 10th 2008 : XWHEP 3.0.0
- X509 certificate proxy usage to enable resource sharing with institutional grids;
- synchronization improved: each message now expects an answer from server;
- performance degradation solved on server side.
The two next figures show 1000 submissions received by server. We can see the performance degradation on the first one; the 2nde figure shows that degradation is now solved.
Total execution time is 2.5 times higher because messages now expect an answer from server : this increases synchronization. - Septembre 1st, 2008 : XWHEP 2.1.0
- the scheduler has been modified to improve performances. It is not a simple round-robin any longer : it now searches the full task set to try to find a task that could fit worker needs;
- the autotest script now submits group tasks too.
The following SQL command shows result more readably
select apps.name as app,label,
hex(works.accessrights),hex(apps.accessrights),
works.status,users.login as worker_login,
users.rights as worker_level
from
works,apps,tasks,hosts,users
where
works.appuid=apps.uid and
works.uid=tasks.uid and
tasks.hostuid=hosts.uid and
hosts.owneruid=users.uid
order by works.label;
We can see that public worker (which login is worker) has run public jobs only (which labels are public…); private workers (which logins are user…) has run jobs of their own identity only (which labels end by their own login).
- August 29th, 2008 :2.0.0
- bugs solved on cache management;
- bugs solved on users and application management, on server side;
- bugs solved on client GUI;
- bugs solved on server certificate management.
To install this version, you must reinstall the database.
The server is now certified by an autosigned SSL key which must be generated by createKeys. Next version will use X.509 certificate certified by a CA.
Installation and deployment needs following actions (in that order : createKeys must be executed before install).
$> make removeDB
$> make installDB
$> make clean
$> make
$> make createKeys
$> make install
For a production deployment, keys must be safelly stored, otherwise (if you lose or accidentally regenerate keys) a full re-deployment is necessary.
Electronic key usage has a cost in terms of communication.
Figure on the left shows the necessary TCP packet amount without SSL. Figure on the right shows the one with SSL : the packet amount is as twice.
A script to auto test the platform is now provided in bin repertory:
$> xtremweb.tests.pl
You must have the platform privileged rights to run the script (as provided by the default client config file).
The script does the following:
- insert a new public application
- insert two new user groups
- insert 6 new users : two users per group and two users with no group
- insert a private application per user
- insert 12 jobs : on private and one public per user
- launch one public and 6 private workers on local host
- jobs monitoring
At the end of the script, we can see that all jobs are COMPLETED.
We can verify this with the following SQL command (we can’t check this with the client because the client does not show worker identity for each job) :
select works.status,works.label,hosts.name,users.login
from users,works,tasks,hosts
where tasks.uid=works.uid
and tasks.hostuid=hosts.uid and
hosts.owneruid=users.uid
order by users.login;
Next figure shows auto test results.
We can see that public worker (which login is worker) has run public jobs only (which labels are public…); private workers (which logins are user…) has run jobs of their own identity only (which labels end by their own login).
- July 30th, 2008: you can download the MSDev deploiement solution
- July 24th 2008 : 1.2.0
A bug found on worker side, in standard input (stdin) management.
There is still a problem if user application test stdin availability. There is great chance that the platform has not had time to set stdin correctly before the application test.
This is due to Java langage used to develop the platform.
User application developpers should not test stdin but just read it. If data availability from stdin is only an option for the application, please use a text file instead.
Example:
Next works correctly if myApp reads from stdin, but does not test stdin
$> myApp < aFile
Otherwise, if data from stdin is optionnal, please modify your application so that you can retreive your optionnal datas from a text file; just like in:
$> myApp -f aFile
Sorry for inconveniences.
- July 21st 2008 : 1.1.0
- introducing X.509 proxy management
- a bug solved on client side, on data download
To install this version, you must reinstall the database
$> make removeDB
$> make installDB
$> make install
- July 17th 2008 :1.0.31
- a bug found on worker side, on error management
- a bug found on input/output
- the worker HTML page is now customizable
- performances improved thanks to a better cache management
Next figures show 1000 job submissions. We can see that I/O correction and cache usage need four times less internal calls for an execution 14 times faster.
- Jun 19th 2008 : 1.0.30
A major bug found on task management, on server side.
- May 22nd 2008 : 1.0.29
Three bugs corrected
- config output when inserting a new user;
- tasks downloads;
- datas URI.
- May 7th, 2008 : 1.0.28
Minor clean up only.
- Apr 28th, 2008 : version 1.0.27 successfully tested on Grid5000 : 12000 jobs executed over 200 workers
- Apr, 25th 2008 : 1.0.27
Bug resolved in result storage protocol.
- Apr, 23th 2008 : 1.0.26
Bug resolved in data submission on client side.
- Apr, 17th 2008 : 1.0.25
A database access bug resolved.
- Apr, 15th 2008 : 1.0.24
We don’t use log4j any more, since we suspect memory leaks.
- Apr, 8th 2008 : 1.0.23
A bug resolved on result upload on worker side.
- Apr, 4th 2008 : 1.0.22
A bug resolved on communication layer.
- Apr, 3rd 2008 : 1.0.21
A bug resolved on worker side.
- Apr, 2nd 2008 : 1.0.20
A bug resolved on client side; result download works corretly now.
- Apr, 1st 2008 : 1.0.19
The client automatically creates datas when submiting jobs with stdin and/or environment.
- Mar 17th 2008 : a bug on communication delays resolved.
- Feb 20th 2008 : a bug on HTTP layer is under expertize; TCP layer is the default one until further notification.
- Feb 7th 2008 : two bugs solved :
- inter thread synchronization on server side;
- public/private IP resolution.
- Jan 25th 2008 : the scheduler has been debugged; it now correctly manages private, group and public applications and workers:
- public worker has "WORKER_USER" user rights; it can manage public jobs only (jobs access rights includes o+rx);
- group worker is a public worker (with "WORKER_USER" user rights) belonging to a group; it can manage group jobs only (jobs access rights includes g+rx);
- private worker is a non public worker (without "WORKER_USER" user rights) and can manage its own user jobs only (jobs access rights includes u+rx)
- Jan 16th 2008 : a bug found on client level;
- Jan 11th 2008 : a bug found on DB management;
- Jan 8th 2008 : installers ready; client debugged;
- Dec 21th 2007 : The middleware answers to requirements. Deeper tests under process;
- Nov 26th 2007 : The middleware is on last testings.