Events:

2012-12-31 at 09:46 [xxx (Ekman)]
The batch processing on EKman has been restarted.
2012-12-30 at 03:42 [xxx (Ekman)]
I got the message that a RAID set on one file server is degraded. I stopped the batch system because the situation is unclear at the moment. However, no jobs are waiting at the moment. An investigation of the situation will happen during Sunday.
2012-12-28 at 11:46
The license server is back. Please inform "support@pdc.kth.se" in the case of problems to use software licenses.
2012-12-28 at 10:04
The license server 'lic.pdc.kth.se' is currently down. Software that gets license from it will not be available until the problem is fixed.
2012-12-21 at 15:00
Povel has an allocation pause until 20121222 in order to make some changes to node allocations. (Please note that no jobs are expected to be started before that even if there were no allocation pause.)
2012-12-10 at 01:25 [xxx (Ekman)]
The login server ekman.pdc.kth.se has been restarted due to problems that blocked the login to it earlier on Sunday.
2012-12-04 at 11:07 [xxx (lindgren)]
The lindgren login is restarted and available again.
2012-12-04 at 10:46 [xxx (lindgren)]
Due to lingering problems from the blackout yesterday, the lindgren login node will be rebooted today at 11:00. These problems only affect the login node, so all running and queued jobs will be unaffected.
2012-11-28 at 11:41
Povel and Ellens connection to Klemming will be temporarily down due to an emergency replacement of faulty memory in the server acting as gateway between Povel/Ellen and klemming. Processes on Povel/Ellen with files open in Klemming will most likely be affected.
2012-11-13 at 21:59 [xxx (lindgren)]
The klemming problems (short dropouts) possibly were of transient nature. Scheduling resumed. Please report unexpected behavior.
2012-11-13 at 21:45 [xxx (lindgren)]
We seem to have a problem with the file system Klemming. Job starts inhibited until looked through. Due to the late hour investigation might take a while to initiate.
2012-11-12 at 18:06 [xxx (Ferlin)]
Some remaining problems with the queue has now been cleared up and jobs should now start normally. If you see jobs that you know has already finished still allocated to nodes you can release them with 'sprelease'. Please report anything out of the ordinary.
2012-11-12 at 15:10 [xxx (lindgren)]
Part of the scheduler had crashed and has now been restarted and jobs are starting again. Please report anything out of the ordinary.
2012-11-12 at 14:30 [xxx (lindgren)]
The scheduler is currently down. Investigation in progress.
2012-10-24 at 22:39 [xxx (lindgren)]
To free up possibly locked system resources the log in node will shortly be rebooted. Some users have had problems with accesses to /afs/ recently.
2012-10-24 at 18:25 [xxx (Zorn)]
On Thursday (2012-10-25) will we have a course using the GPU cluster Zorn. Other users are kindly asked not to submit jobs during the time from 10 to 16 o'clock.
2012-10-24 at 18:23 [xxx (Zorn)]
On Thursday (2012-10-25) will we have a course using the GPU cluster Zorn. Other users are kindly asked not to submit jobs during that time.
2012-10-20 at 10:56
Services at PDC and compute node availability in the clusters Ferlin, Ekman and Povel could be influenced from maintenance work at the CSC school infrastructure in the course of the day. Automatic recovery is to expect with the end of this measure. http://www.kth.se/en/csc/it-support-csc/news/serverunderhall-lordag-20-okt-1.342464
2012-10-18 at 13:55 [xxx (Ekman)]
Cluster Ekman: The restart of the file server has been done and it looks that at least most of the applications run further. To be sure that everything works fine, please check your running jobs.
2012-10-18 at 11:10 [xxx (Ekman)]
The restart of the fileserver will happen today, 2012-10-18, at 13:00.
2012-10-18 at 10:44 [xxx (Ekman)]
The announced restart of the fileserver hast to be executed later. We will send an updated information.
2012-10-17 at 11:33 [xxx (Ekman)]
A short operation stop is needed in order to restart one of Ekman's file servers. Running jobs can probably continue, but, an interrupt could also happen. The operation stop will happen tomorrow, 2012-10-18 about 10:30.
2012-10-16 at 20:19
The Klemming file system is now back and the queue on Lindgren resumed again. The underlying reason for this hiccup in one small part of the system was not really found, but was probably related to the server crash earlier today. Please report if you see any problems again.
2012-10-16 at 17:25
We have again problems with the file system Klemming. Different problem from earlier today but might be related. Investigation in progress.
2012-10-16 at 13:28
The file system Klemming should now be fully operational again. Reason for crash not fully understood yet. Please report anything out of the usual.
2012-10-16 at 12:15
One of the servers for the file system Klemming in down. Investigation in progress.
2012-10-08 at 21:12 [xxx (lindgren)]
After a restart of one of the servers of /cfs/klemming earlier this evening, the file-system is back in ordinary operation.
2012-10-08 at 10:30 [xxx (lindgren)]
We are having some issue with the /cfs/klemming on Lindgren. We are investigating the issue and will get back to you as soon as we have more information.
2012-10-08 at 10:28 [xxx (lindgren)]
We are seeing intermittent problems with access to the file-system klemming from Lindgren. Investigations are in progress.
2012-10-03 at 11:28
The file system, Klemming, is now fully back again and all batch queues are started. Please report anything out of the usual.
2012-10-03 at 09:18
The file system Klemming is currently having some problems. Investigation in progress.
2012-10-01 at 08:56 [xxx (Ekman)]
Cluster Ekman: Due to a failure in issuing a command I terminated accidentially more jobs than intended. Please resubmit. Sorry for the inconvenience.
2012-09-28 at 14:24 [xxx (Ellen)]
Ellen is back in production. The maintenance went well and faster as planned.
2012-09-26 at 14:58 [xxx (Ellen)]
Ellen will be not available due to maintenance work on Friday afternoon. The maintenance is planned 2012-09-28 13:00 - 16:00 o'clock.
2012-09-20 at 20:27
The filesystem /cfs/klemming is not accessible from the cluster Povel as well as the system Ellen.
2012-09-20 at 17:34 [xxx (lindgren)]
maintenance finished; lindgren is running jobs since a while, and now also available for log in again.
2012-09-18 at 14:11
PDCs primary ftp server (ftp.pdc.kth.se) has been restarted to resolve problems with its access to AFS.
2012-09-14 at 13:29 [xxx (lindgren)]
Coming Thursday, 2012-09-20, lindgren will be unavailable for hardware maintenance starting at noon, 12:00. The system is expected to be back during late afternoon/evening that day.
2012-09-06 at 13:07 [xxx (Ekman)]
During the the next days will Ekman be operated with a slightly reduced number of available nodes. you can expect a number of 1150..1160 available nodes. Please have this in mind when submitting jobs according to your share. The nodes have been taken out of operation in order to allow movement and refurbishing of compute nodes that beacame defective during the months since the end of the warranty service.
2012-08-24 at 20:50 [xxx (lindgren)]
Batch jobs resumed since 20 minutes. There is no reason found yet for the stop of /cfs/klemming/.
2012-08-24 at 15:30 [xxx (lindgren)]
The Klemming file system is now back on-line again after a restart of the disk system. Reason for failure is still unknown but probably related to the crash on Monday. The batch system on Lindgren is still halted until more checks have been done.
2012-08-24 at 13:51 [xxx (lindgren)]
Lindgren's filesystem has problems. Investigation is ongoing.
2012-08-21 at 11:45 [xxx (lindgren)]
/cfs/klemming is now on-line again. Batch scheduling resumed. Note that running jobs have been stuck during outage and likely many running jobs have experienced problems.
2012-08-21 at 07:17 [xxx (lindgren)]
The file-system /cfs/klemming had a failure a few hours ago, and is currently not operational, i.e., all file-accesses freeze.
2012-08-17 at 11:44 [xxx (Ekman)]
Batch job processing continues again.
2012-08-17 at 09:05 [xxx (Ekman)]
The batch system has been stopped because the groups meteorology groups run on >900 nodes that prevents jobs of the mechanics group to start. The situation hopefully stabilizes again during the next hours.
2012-08-14 at 10:15 [xxx (Zorn)]
Cluster Zorn: The system is back in production.
2012-08-13 at 17:23 [xxx (Ekman)]
Cluster Ekman: No new batch jobs start at the moment due to network problems in KTH and PDC. Running jobs shoul dnot be affected. The evaluation and works to solve the issue are under way.
2012-08-10 at 10:47 [xxx (lindgren)]
We had problems with the main lindgren login node (lindgren.pdc.kth.se) this morning causing it to reject new ssh connections. It should be available again now however.
2012-08-10 at 09:23 [xxx (Zorn)]
Cluster Zorn: The lustre file system became more instable during the last days. More diagnosis will be undertaken during the day. Access to the system is not possible for that reason at the moment.
2012-08-05 at 18:15 [xxx (Zorn)]
Cluster Zorn: Operation has been stopped due to new problems with the Lustre filesystem since today 16:30.
2012-07-29 at 20:14 [xxx (Ekman)]
The cluster Ekman is back in production.
2012-07-29 at 11:34 [xxx (Ekman)]
The Lustre file system of Ekman has a problem that maybe crashes running jobs. Furthermore, the batch system has been stopped to avoid running new jobs into problems. Reason is a crash of one the fileservers. It is unclear at the moment how severe the accident is and when we can go back into operation.
2012-07-25 at 11:12 [xxx (Ellen)]
The leapsecond problem on ellen should now be fixed without reboot and MATLAB seems to be working again. Please watch out for hanging matlab processes and try to kill and restart those.
2012-07-25 at 10:15 [xxx (Ellen)]
We are currently seeing hanging MATLAB processes on ellen.pdc.kth.se. This may be related to a java bug causing the JVM used by Matlab to spin. Please refrain from using matlab without the -nojvm option. We may also have to reboot the machine to rectify the problem, which we will announce separately.
2012-07-21 at 10:00
Saturday 2012-07-21: Maintainance of AFS servers in nada.kth.se. If your $HOME is in /afs/nada.kth.se/.... you are probably affected and should be aware of this. If your data is in /afs/pdc.kth.se/... you are not affected.
2012-06-20 at 09:49 [xxx (Zorn)]
The Lustre fileystem is down again. Batch opration is not possible at the moment.
2012-06-19 at 18:25 [xxx (Zorn)]
Zorn filesystem has been repaired and improved. System is back online.
2012-06-19 at 17:30 [xxx (lindgren)]
software maintenance: the system software upgrade is finished. We will gradually resume batch. Welcome back, and please report deviations from expected behaviour.
2012-06-19 at 16:30 [xxx (Zorn)]
Zorn's parallel file system is currently not available. Investigations on their way.
2012-06-13 at 14:41 [xxx (lindgren)]
software maintenance: lindgren will be unavailable starting 2012-06-19 09:00 for system software upgrades. We aim to be finished within 24..48 hours.
2012-06-12 at 12:47 [xxx (Ekman)]
The login server "ekman.pdc.kth.se" is online again.
2012-06-12 at 12:36 [xxx (Ekman)]
The login server "ekman.pdc.kth.se" became instable likly due to some failures in file transfers. This lead even to some instable system services. Therefore the login node will be restarted.
2012-06-06 at 00:50 [xxx (lindgren)]
The system is back and running jobs since roughly 1/2 hour. The actual cause of module shutdown is not yet determined.
2012-06-05 at 21:40 [xxx (lindgren)]
As several modules have shut themselves off, the interconnect is currently severely degraded. We are trying to figure out whether the shut-offs are due to 'real physical' problems, or due to false positives. For the time being system is to be considered off-line.
2012-05-10 at 16:37 [xxx (Ferlin)]
The batch processing on Ferlin continues again. Two racks must be switched off at the moment for more repair work and further maintenance.
2012-05-10 at 12:21 [xxx (Ferlin)]
It is not possible to start new jobs at th emoment due to network problems.
2012-05-07 at 17:37 [xxx (Hebb)]
The GPFS filesystem, /gpfs/scratch, is currently unavailable and no jobs will start. The reason is probably too many disk failures, but investigation is ongoing.
2012-04-30 at 14:00 [xxx (lindgren)]
the system has now been restarted and jobs waiting in line are being started.
2012-04-30 at 11:59 [xxx (lindgren)]
while rebooting a few compute nodes we got a stuck HSN (interconnect.) We will restart the system within the next hour.
2012-04-24 at 21:23 [xxx (Ekman)]
The cluster Ekman will not be available on Thursday, May 3rd 2012, from 11:00 until 17:00 due to necessary upgrades of the system software.
2012-04-24 at 21:23 [xxx (Ferlin)]
The cluster Ferlin will not be available on Thursday, May 3rd 2012, from 08:00 until 13:00 due to necessary upgrades of the system software.
2012-04-21 at 09:00
If you have your AFS home directory at nada.kth.se (*), you might be affected by the following planned maintainance at CSC/NADA:

Saturday April 21, starting at 9am maintenance work will be performed on some CSC servers.

Most computers at CSC will be affected during this time. Services like email and www will also be affected.

AFS at PDC is _not_ affected.

Your system groups at CSC and PDC.

(*) fs where $HOME | awk '$6 ~ "nada"'

2012-03-19 at 15:21 [xxx (Ekman)]
Disc cleaning of /cfs/ekman/scratch will be started on Tuesday, March 20th, 2012.
2012-03-19 at 11:39 [xxx (SBC / CBR)]
One AFS server, shad.pdc.kth.se, serving volumes for SBC has suffered a major disk/RAID failure. Futher investigation is onging to figure out how serious the problem is. Update (12:55): RAID failure _was_ serious, restore from backup has started.
2012-03-14 at 14:40 [xxx (Ekman)]
Disc cleaning of /cfs/ekman/scratch will be started on Thursday, March 15th, 2012.
2012-03-12 at 22:53 [xxx (lindgren)]
The log-in node was unreachable and has been restarted. This have no effect on jobs running in batch-system.
2012-02-29 at 23:10 [xxx (Ferlin)]
A file-system hick-up caused a stale lock for scheduling on Ferlin; the lock has been removed and operation resumed to normal.
2012-02-29 at 17:03 [xxx (lindgren)]
lindgren login restarted. It naively looks like a lustre-filesystem lockup, but is far from certain.
2012-02-29 at 16:10 [xxx (lindgren)]
Login on Lindgren is currently not possible.
2012-02-29 at 16:11 [xxx (Ferlin)]
The login node "ferlin.pdc.kth.se" has been restarted.
2012-02-28 at 11:37 [xxx (lindgren)]
the login has been restarted (no jobs should have been affected.) Please report any un-expected side-effects.
2012-02-28 at 11:18 [xxx (lindgren)]
the login node currently is having problem; if investigation takes to long we will reboot it. Jobs are not affected.
2012-02-24 at 12:14 [xxx (lindgren)]
As a number of jobs produced large amounts of output and did fill the spool of the pbs_mom running jobs on lindgren, several jobs were affected; at least some will have mangled stdout/stderr output files.
2012-02-23 at 16:00 [xxx (Ellen)]
HP service has repaired system Ellen, however we have decided to reinstall the machine with a newer operating system. This will enable some of the missing features like local HD. Please have some patience while we are getting all pieces into place, this is the only DL580G7 we have and there might still be some surprises left.
2012-02-20 at 00:00 [xxx (Ellen)]
High memory system ELLEN (p05c01n01) is shutdown because of HW problems. Alternate node to access your files (but NOT for computation) is ellen-tmp.pdc.kth.se. Maintainence planned for 2012-02-21, outcome yet unknown. Maintainace starts delayed because HP service dispatch can not call correct telephone number mentioned in email. HP tried to repair the node for 1 1/2 hours without success. More repair efforts with more new parts will continue tomorrow (2012-02-22). 2012-02-22 morning: HP still has difficulties to contact me on my telephone number. HP technician recommends change of system board, but HP service dispatch does not think that they must dispatch a technichan earliest tomorrow (2012-02-23) because "contract".
2012-02-17 at 19:40
The malfunctioning server has been working fine since about 18 o'clock and the clusters Ferlin and Ekman are buzzing again. Have a nice weekend!
2012-02-17 at 16:29
At the moment we have some problems with one of the Kerberos ticket servers. It happens at the moment that compute nodes experience timeouts in the authentification what makes starts of batch jobs on several systems impossible, at least Ekman, Ferlin, Povel.
2012-02-14 at 10:43 [xxx (lindgren)]
Early this morning one of the servers serving jobs requiring dynamic shared libraries (aka DSL) went down. Jobs relying on this server might have been affected. It is being restarted.
2012-02-06 at 11:14 [xxx (Ekman)]
The Lustre file server causing the interruption on the weekend is checked and put back into operation. The batch system has been started again.
2012-02-05 at 00:25 [xxx (Ekman)]
Batch system stop due to problems with the Lustre file system on Ekman. Servers of the file system reported in the course of Saturday errors. Many servers could not use the filesystem after that. The investigation and repair of the problem will continueon Monday.
2012-01-26 at 12:53 [xxx (lindgren)]
The system is available again. We will allow batch jobs to start within a few hours.
2012-01-25 at 12:17
The file system Klemming is now back on-line and has been up through the night without problems after the extended upgrade. As always, please report if you notice anything out of the ordinary.
2012-01-24 at 17:31
Due to previous hardware erros causing unforseen problems with the update to the file servers for Klemming (/cfs/klemming), the file system will not come on-line during the afternoon today, as hoped. It will probably be available again some time before lunch tomorrow.
2012-01-24 at 08:15 [xxx (lindgren)]
as earlier announced, the system now is taken off-line for extended maintenance.
2012-01-23 at 14:30
As the system serving the file system Klemming (/cfs/klemming) will be updated during the scheduled service window of Lindgren tomorrow, the file system will be unavailable during that process. Note that this affects all systems that mount /cfs/klemming, not only Lindgren. The file system is expected to be back sometime tomorrow afternoon.
2012-01-18 at 11:25 [xxx (Ferlin)]
The login node "ferlin.pdc.kth.se" had to be restarted due to an overload in the filesystem accesses that put the system in an inconsistent state.
2012-01-17 at 20:20 [xxx (Ekman)]
The system update is progressing. The batch system operation will continue about 22 o'clock.
2012-01-13 at 10:17 [xxx (Ekman)]
Ekman works again normally. The stop was an effect of the rebuild due to the file system failure two days earlier. No additional hardware problems occurred.
2012-01-12 at 21:11 [xxx (Ekman)]
There have been problems again in the access to the Lustre file system since 18:20. The batch job start has been stopped until a further investigation and repair will be possible. Running jobs are displayed as such but will very likely be destroyed and in inconsistent state.
2012-01-12 at 13:50
[swegrid] Ruth is being upgraded and is not accepting fairshare jobs at the moment. It is being reconfigured.
2012-01-11 at 15:43 [xxx (lindgren)]
extended maintenance: starting Tuesday 2012-01-24/08:00, and through Thursday 2012-01-26, Lindgren will be off-line for maintenance.
2012-01-11 at 12:56 [xxx (Ekman)]
failed server restarted, /cfs/ekman on-line, and jobs are allowed to start again.
2012-01-11 at 09:42 [xxx (Ekman)]
/cfs/ekman is having 1 down storage server. Accesses to the file-system will block.
All flash news for 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss