Events:

2010-12-17 at 17:10 [xxx (Ekman)]
Parts of rack 21 on Ekman are without power. Jobs currently running on nodes k21*.pdc.kth.se might thus be damaged. Sorry for the inconveniance.
2010-12-13 at 14:26 [xxx (Ekman)]
The system is back in operaion fo ra while now. The OSS server has been checked, is running well again, and new jobs have been started again.
2010-12-13 at 09:48 [xxx (Ekman)]
One ekman lustre oss-server is experiencing problems. Lustre partially unavailable. No new jobs will start during investigation.
2010-12-10 at 07:32 [xxx (lindgren)]
Lindgren is now stopping and shutting down for upgrade.
2010-12-09 at 18:32 [xxx (lindgren)]
Lindgren went down, most likely due to human intervention. It's being restarted.
2010-12-07 at 12:13 [xxx (lindgren)]
we have now restarted lindgren.
2010-12-07 at 10:59 [xxx (lindgren)]
lustre stuck. investigation in progress.
2010-12-03 at 14:52 [xxx (lindgren)]
Lindgren now available again after heat-exchanger upgrade.
2010-11-29 at 06:54 [xxx (lindgren)]
As announced separately we will now bring down and cover lindgren during the replacement of over-head heat exchangers. This is supposed to take in the order of days.
2010-11-26 at 15:35
The primary kerberos server has network connectivity again. Its operation should be back to normal.
2010-11-26 at 14:50
The primary kerberos server for checking out tickets was unavailable from most of 130.237.0.0/16 from 2010-11-24 15:00 to 2010-11-26 15:30. Secondaries are still operational. Most clients/users with ordinary network access and configuration, should communicate with any of the secondaries automatically while those with restricted/incomplete network access, or special configuration, might experience problems checking out tickets. From the affected networks, password changes were not possible. Password changes were propagated to the slaves again after the outage.
2010-11-23 at 23:15 [xxx (Ferlin)]
The login node "ferlin.pdc.kth.se" was not accessible over the network and has been restarted. The operation is normal now.
2010-11-18 at 21:06 [xxx (lindgren)]
Lindgren has been restarted and is on-line again.
2010-11-18 at 16:07 [xxx (lindgren)]
Lindgren's Lustre file system is misbehaving. We are working on this problem and we will let you know when the system is back in business again.
2010-11-15 at 16:47 [xxx (Ekman)]
The login node of ekman is being restarted due to very low responsiveness.
2010-11-02 at 10:47
Due to a software error in one of our Juniper routers, most of PDC was not reachable 10:47-11:24 MET.
2010-10-27 at 18:33 [xxx (Ekman)]
Ekman is accessible and runs jobs again.
2010-10-27 at 01:42 [xxx (Ekman)]
After several security incidents during the last weeks (1 three weeks before, 3 in the course of the last week) has been released several update of the Linux distribution. For this reason we are going to install all compute nodes new. Therefore the node allocation to jobs has been stopped. In the course of the day we will install most of the nodes and give them back then to the general use. The login node will be not available after lunch between 13.30 and 14.30. Just now running jobs will continue to work.
2010-10-26 at 01:47 [xxx (Ekman)]
The access to the system is possible again. The number of available nodes is somewhat reduced at the moment.
2010-10-26 at 01:47 [xxx (Ferlin)]
The access to the system is possible again. The number of available nodes is somewhat reduced at the moment.
2010-10-25 at 16:20 [xxx (Ekman)]
In the course of tonight we will bring in place a system configuration which allows to allow the access to the system again as well as the continuation of the jobs. The time when the access will be possible again is not defined precisely yet. Thank you very much for the patience in advance.
2010-10-25 at 16:14 [xxx (Ferlin)]
In the course of tonight we will bring in place a system configuration which allows to allow the access to the system again as well as the continuation of the jobs. The time when the access will be possible again is not defined precisely yet. Thank you very much for the patience in advance.
2010-10-25 at 08:10 [xxx (Ekman)]
For the security vulnerability CVE-2010-3856 is no correction available yet for Redhat/CentOS Linux distributions. Due to this reason the access to the cluster remains blocked. We will inform here about the restate of access in the course of today.
2010-10-25 at 08:10 [xxx (Ferlin)]
For the security vulnerability CVE-2010-3856 is no correction available yet for Redhat/CentOS Linux distributions. Due to this reason the access to the cluster remains blocked. We will inform here about the restate of access in the course of today.
2010-10-24 at 16:51 [xxx (lindgren)]
Access to lindgren enabled again. Applications requiring 32-bit libraries will likely fail as 32-bit libraries are grounded.
2010-10-22 at 21:38
Due to security vulnerability CVE-2010-3856, access to most systems is disabled until further notice.
2010-10-20 at 10:34
login re-instated on select systems.
2010-10-20 at 08:41
On select systems: login disabled while investigating the 2nd security alert this week. This time CVE-2010-3904.
2010-10-18 at 16:19 [xxx (Ekman)]
All users had to be logged out from the login node of the system due to a potential security problem without the possibility of a notice in advance. We are sorry for the inconvenience. The operation is now continuing normal.
2010-10-18 at 16:19 [xxx (Ferlin)]
All users had to be logged out from the login node of the system due to a potential security problem without the possibility of a notice in advance. We are sorry for the inconvenience. The operation is now continuing normal.
2010-10-18 at 16:11
Due to Linux security issue CVE-2010-3847, access to systems will have to be restricted and computers will have to be rebooted with short notice to make coutermeasures effective.
2010-11-01 at 06:00
Due to work on the power infrastructure at PDC all computational resources will be in unavailable between 2010-11-01 06:00 to 2010-11-02 06:00 (or earlier if less time is needed).
2010-10-05 at 10:54 [xxx (lindgren)]
Lindgren is currently unavailable - investigations are in progress.
2010-09-21 at 13:15
The AFS volumes which were not accessible due to a server outage this morning are available again.
2010-09-21 at 10:40
afs-server houting has gone missing (again.) This time with less volumes on it, but several do unfortunately remain. The migration out of it was postponed during security audits.
2010-09-20 at 10:06 [xxx (lindgren)]
informational/clarification - as sent user list, lindgren got its security patches and went on-line again 2010-09-18/23:00.
2010-09-17 at 17:19 [xxx (Ferlin)]
The necessary upgrades due to a security exploit have been done. The login node is accessible by users again. Now it is possible to submit new jobs.
2010-09-16 at 10:42
Login to all systems are disabled due to security audit.
2010-09-13 at 12:00
file-server houting now salvaged and services restored. We will gradually move all data out of the server. This is a transparent operation, while it might slow it down.
2010-09-13 at 12:00
file-server houting now salvaged and services restored. We will gradually move all data out of the server. This is a transparent operation, while it might slow it down.
2010-09-13 at 09:04
The reported problem with the AFS server "houting.pdc.kth.se" causes that not all volumes can be mounted on the cluster systems Ferlin and Ekman. Therefore no jobs will start until the problem is solved.
2010-09-11 at 22:17
the afs-server houting (one out of 11 in use) has gone down. It contains a subset of home volumes (user home directories) as well as a subset of applications (i.e. some versions of gaussian.) Repairs will start Monday morning, the very latest.
2010-08-31 at 14:12 [xxx (Ekman)]
There will be available a reduced number of nodes on Thursday, September 2nd, 2010 between 07:00 and 12:00. Background information: At this day we will perform some works for the extension of the power supply installation of PDC. During the works we have a limited capacity for backup cooling. In the case of unexpected events we have to shutdown systems partially very quickly. Therefore the usage of Ekman will be limited to around 50% of the systen size. As a side effect of this you may oberserve now that submitted jobs will not start immediately even if nodes are free. This happens due to an eventual overlap of the runtime of your job and the reservation for Thursday. Best regards, Michael Schliephake
2010-07-29 at 18:28 [xxx (Hebb)]
Hebb should be available again now.
2010-07-29 at 11:10 [xxx (Hebb)]
Due to a blown fuse Hebb will be taken down for service immediatelly.
2010-07-28 at 11:38 [xxx (Ekman)]
The login node ekman.pdc.kth.se of the Ekman cluster did crash/panic. Probably because of an error in the Lustre client kernel module. We will reboot and investigate.
2010-07-16 at 17:28 [xxx (Ekman)]
ekman.pdc.kth.se - the login node of the Ekman cluster - was rebooted due to out-of-memory issues. It is back now.
2010-07-06 at 15:17 [xxx (Ekman)]
The login node, ekman.pdc.kth.se, again has problems accessing the parallel filesystem and will be rebooted again.
2010-07-05 at 22:04 [xxx (Ekman)]
The reboot of the login node seems to have resolved the deadlocks. Everything should now be back to normal and jobs are starting again.
2010-07-05 at 21:04 [xxx (Ekman)]
The problem with the parallel filesystem on the login node reappears a short while after restarting the meta data server. The login node will now be rebooted to, hopefully, clear possible filesystem deadlocks on that node.
2010-07-05 at 19:26 [xxx (Ekman)]
The login node, ekman.pdc.kth.se, can't access the parallel filesystem anymore. The metadata server is being restarted to try to solve the problem. The queue has been stopped until the problem is fixed.
2010-06-25 at 10:33
Severe cooling disturbances. We have cooling problems. On especially Ekman several nodes has auto-shutdown as a safety pre-caution. Your jobs have gotten damaged.
2010-06-16 at 19:40
Major routing / network problems in SUNET. No connectivity from KTH to the Internet.
2010-05-19 at 18:54
Due to the combination of network maintainance and misconfiguration of routers/switches both at PDC and uppstream (KTH), PDC users have experienced extreme bad network performance during the afternoon. Currently, all network equipment is working again, but with unkown or reduced redundancy.
2010-05-17 at 11:32 [xxx (Ferlin)]
The login node had to be restarted. The data below /scratch have been removed..
2010-05-05 at 13:27 [xxx (Ellen)]
Key is currently unavailable. We are working on getting her back in business. Sorry for the inconvenience.
2010-04-26 at 20:08 [xxx (Ekman)]
No serious errors were found in the underlying filesystem and the queue has now been resumed again.
2010-04-26 at 18:57 [xxx (Ekman)]
One of the fileservers serving the cluster filesystem of Ekman complained about its underlying filesystem and is now running a check of that filesystem. Preliminary checks only show minor errors, possibly from when the filesystem got full. No new jobs will start until the problems have been corrected, which is expected to complete in ~1 hour. If no more problems are encountered, the queue will then be resumed.
2010-04-22 at 13:52 [xxx (Ferlin)]
login (ferlin) recovered. We will keep it under observation. Sorry for the inconvenience.
2010-04-22 at 13:06 [xxx (Ferlin)]
The login node (ferlin) is currently experiencing problems. Fault search in progress.
2010-04-21 at 08:00
Power work successfullt ended. Reboots and related system maintainance in progress. Systems will gradually return during the day (Maintainance windows ends at 17:00 CST).
2010-04-04 at 00:59
This is an after-the-fact flash news for the AFS server problem we had since 2010-04-04 00:59 CST. At that moment one AFS server hardware died for yet unknown reasons. It has not been possible to power on that particular server again. The I/O system of that server, including all drives, has been transplanted into another server hardware. After the usual file system repairs, the AFS server including all data is available again. We currently have no indications of data loss, but please notify oss as soon as possible if you think something is amiss.

With this file server running again, we are currently restarting the batch systems and are again able to send you this message.

Sorry for the inconvinience, Harald for PDC-staff.

2010-04-01 at 17:17
Strong grime buildup requires shutdown and cleaning of the espresso machine at PDC. During downtime, staff response may be slower because of the resulting lower coffeine level, so please refrain from any extraordinary activities, especially on the login nodes. Normal operation of the espresso maker and the staff will resume after the easter holidays.

For PDC-staff, Harald.

2010-03-23 at 15:36 [xxx (HSM)]
IBM had a spare part close by which they have now replaced. The HSM is back to normal again.
2010-03-23 at 11:18 [xxx (HSM)]
A powersupply in one of our tape libraries has gone up in smoke. This means that it is currently not possible to recall files from tape in our HSM system. More information to come when the service guys have been here.
2010-04-20 17:00
Due to service on PDCs power infrastructure all compute systems will be unavailable between 2010-04-20 17:00 and 2010-04-21 17:00. No queued jobs should be interrupted but may be delayed to after the service window.
2010-03-04 at 17:53 [xxx (Ekman)]
Some operations in the AFS filesystem on ekman has been left hanging since some time due to the combination of a bug in OpenAFS and an misbehaving server in another AFS cell. This problem has now been identified and resolved and everything should work as usual again. If you still find anything out of the ordinary, please let us know.
2010-02-25 at 12:13
One AFS server with important data for running the easy scheduler had an uncorrectable memory error. Repair took some time as the (DELL) hardware could not tell wich memory module had failed. Now the server is restarted with 1/2 of its memory and the AFS data is on its way moving to a new server. We regret that replication of data could not prevent this outage. So far we do not see any data loss, but we are as allways glad if you report anything suspicious. We will now check the batch systems and restart queues as necessary.
2010-02-25 at 11:33 [xxx (Ellen)]
No normal operation due to AFS server problem.
2010-02-25 at 11:33 [xxx (Ekman)]
No normal operation due to AFS server problem.
2010-02-25 at 11:33 [xxx (Ferlin)]
No normal operation due to AFS server problem.
2010-02-16 at 16:21 [xxx (Ferlin)]
Ferlin Login Node: ferlin.pdc.kth.se Downtime of 15 minutes Wednesday, Feb. 17, 2010 in the time between 9 a.m. and 10 a.m.
2010-02-08 at 16:10 [xxx (Ekman)]
The filesystem is now back on-line again and the queue has been started again. There were no errors found when restarting the filesystem, but as always, please report anything out of the ordinary.
2010-02-08 at 13:03 [xxx (Ekman)]
The fileserver had a kernel crash for unknown reason. It is now running filesystem checks to make sure no data was lost. It is expected to take a couple of hours more.
2010-02-08 at 12:06 [xxx (Ekman)]
The queue on Ekman is currently stopped due to problems with one of the fileservers for the /cfs filesystem.
2010-01-29 at 10:31 [xxx (Ekman)]
The Ekman login node ekman.pdc.kth.se is currently not responding. We are looking into this but the problem might persist for an hour or more.
2010-01-14 at 21:30 [xxx (Ferlin)]
Due to a network configuration fault, parts of clusters ferlin and ruth (swegrid) were unreachable from the Internet between approximately 17:00 and 21:30 MET. (We have not seen any impact on network communications inside PDC)
2010-01-08 at 18:16
The network problems should now be fixed and all services back to normal. There are still broken hardware in at least one of our routers, but a work-around is in place. Further debugging will be postponed until next week and done under more controlled circumstances.
2010-01-08 at 14:16
We are experiencing Network problems. Some users might have difficulties connecting. We are looking into the problem.
All flash news for 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss