Events:

2013-11-29 at 09:35 [xxx (lindgren)]
the login node will be rebooted shortly to free up resources.
2013-11-25 at 15:04 [xxx (lindgren)]
lindgren login has been re-enabled.
2013-11-25 at 12:28 [klemming]
Today at 12:45 we will do some fail-over operations affecting the Klemming file system. This means that there will be some temporary hickups, up to a couple of minutes, when accessing the file system. This should not seriously affect running jobs.
2013-11-23 at 18:24 [xxx (lindgren)]
Lindgren login blocked until a possible security issue has been assessed. Further updates sent through the lindgren-users mail-list.
2013-11-06 at 15:32 [xxx (Zorn)]
The earlier reported power outage caused a damage in the energy distribution plant. A repair should be carried out tomorrow morning. Therefore we hope to finalise the system upgrade tomorrow afternoon and to be back in operation later on.
2013-11-06 at 15:28 [xxx (Ellen)]
The earlier reported power outage caused a damage in the energy distribution plant. A repair should be carried out tomorrow morning and the system back in operation hopefully after lunch.
2013-11-06 at 15:24 [xxx (lindgren)]
The earlier reported power outage affected Lindgren partly and made the reboot of some servers necessary. Nevertheless, it is expected that running jobs did not have suffered from that.
2013-11-06 at 12:01
We have suffered a small power outage which might have affected some user resources also. Investigation in progress.
2013-11-05 at 11:30 [xxx (lindgren)]
The lindgren batch front-end has been overloaded. Most batch-jobs has been / will be affected. Please do not run jobs on batch front-end nodes, use compute nodes.
2013-11-05 at 06:08 [xxx (Zorn)]
The maintenance of Zorn will be extended by one day. A new instability in the access to the Lustre filesystem has been observed and needs to be investigated.
2013-10-29 at 11:03 [xxx (Zorn)]
The cluster Zorn will not be available for users due to maintenance work (OS and software upgrades) from 2013-11-03 08:00 CET until 2013-11-04 17:00 CET.
2013-10-24 at 16:51 [xxx (Zorn)]
The upgrade of the Lustre file system servers (/cfs/zorn) has been finished. Zorn is available again.
2013-10-24 at 10:20 [klemming]
Klemming did have a hickup around 10AM and went into a fault recovery mode during which all I/O operations hanged. Everything seems to have finished recovering now and should be working again. The reason for the hickup is still not clear and we are looking into it. Please report anything still not working as expected.
2013-10-23 at 15:25 [xxx (Zorn)]
The maintenance of Zorn, scheduled for today, will be extended until tomorrow, Thursday October 24th, 2013. Background of this is the increased need of time for the filesystem upgrade.
2013-10-16 at 09:46 [xxx (lindgren)]
Last evening, 2013-10-15, between 22:00 and roughly 22:30 a few jobs were overusing resources, memory most notably, on a batch front-end which caused problems for fellow jobs using the same batch-front end. Please take care running on compute nodes (aprun) rather than on a batch front-end.
2013-10-03 at 20:52
General impact. Due to wrinkles in the PDC DNS (nameservices) certain nameservice lookups has failed for a couple of hours. This has now been fixed. The side-effects has been failures in i.e. qsub, possibly logins, and most applications requiring name service lookups.
2013-10-02 at 11:35 [xxx (Zorn)]
System maintenance on Wednesday, 2013-10-23, from 09:00 until 17:00. Purpose of the maintenance is mainly an upgrade of the Lustre file system to a newer software version.
2013-09-30 at 09:05 [xxx (lindgren)]
the login node will be restarted around 10:00 today to free locked up system resources.
2013-09-18 at 09:18
Login to the cluster Povel (login node povel.pdc.kth.se) is not possible at the moment due to hardware problems.
2013-09-13 at 13:06 [xxx (Ferlin)]
Due to very sluggish behaviour the ferlin login node is being restarted.
2013-09-09 at 15:23
The Povel login node had some problems and has now been rebooted. Reason for problems still unknown. Please report anything out of the ordinary.
2013-09-06 at 23:25 [xxx (lindgren)]
Due to a security warning regarding torque (a part of the batch system on lindgren) torque has been patched. Jobs starting from now on execute using the changed piece of software. In theory you should notice any changes, in practice please report unexpected behaviour.
2013-09-06 at 21:03 [xxx (Zorn)]
Zorn is available again. The security update could be applied very easily and without further complications. Therefore, the normal operation continues now.
2013-09-06 at 20:28 [xxx (Zorn)]
Access to Zorn has to be stopped immediately due to a security problem that has been published an hour before and would also affect the cluster. The affected software package of Zorn will be updated tomorrow during the maintenance that we planned and announced earlier today already.
2013-09-06 at 15:45 [xxx (Zorn)]
Access will to Zorn will be restricted during the weekend September 6th and 7th. Maintenance works will be executed during these two days. These may cause that login to the cluster will be disabled temporarily.
2013-09-05 at 17:39 [xxx (Zorn)]
The login node of Zorn has been crashed due to a user program that has been executed on the login node in order to test job configurations. Please use the node "g01n06.pdc.kth.se" as login node until tomorrow morning when we can take care of the hardware again.
2013-09-03 at 22:00
Because of the VIP visit, the PDC offices and computer hall have to be evacuated of people between 2013-09-03 22:00 and 2013-09-04 18:00 (times MEST). This means there will be no staff available (including personal and phone support) during that time. Computers are hopefully not affected from all the buzz outside. With no staff on site, repairs are not possible. We regret any inconvinience this may cause.
2013-08-19 at 11:01 [xxx (Zorn)]
The cluster Zorn is reserved for the purpose of PDC's Summer School on Aug 19, 2013 from 14:30-16:30 and from August 22, 2013 from 16:00 to August 23, 2013 17:30
2013-08-16 at 20:13 [xxx (Ferlin)]
The login node has to be restarted because logins have not been possible anymore and probably a user overloaded it.
2013-08-12 at 16:05
One of our AFS servers did crash recently but has now finished running a check and is on-line again. This unfortunately did cause some problems accessing e.g. home directories. Everything should now be back to normal.
2013-08-02 at 07:48
The update of the OpenAFS- and Kerberos installations in PDC and the CSC school have progressed well in this week. The last step of this procedure will be made on Monday (Aug 5th). Again like during the measures this week, you should not notice this in your work on PDC systems. In the unexpetced case that you notice failures in the access to your data, please try a sequence of "kdestroy" and "kinit" commands, and repeat it if needed after a few minutes. We will observe the operation of Kerberos and OpenAFS a while after this last step and plan to remove thereafter the access blocks in PDC's and KTH's ports to the external networks. Therefore, we expect that PDC's OpenAFS will be accessible worldwide on Tuesday again.
2013-07-30 at 15:38 [xxx (lindgren)]
The Lindgren login node is currently inaccessible. We are working on the problem and it will hopefully be fixed soon.
2013-07-30 at 09:24 [xxx (Ellen)]
Users with NADA home directories are likely to have problems logging into Povel and Ellen. This is related to a security issue and will hopefully be fixed today.
2013-07-29 at 15:19
Direct access to PDC's OpenAFS filesystem is not possible from outside of PDC's network until further notice due to the earlier reported security problems in OpenAFS. You can access your data if you login to PDC or if you use scp via PDC servers. Some hints how to use scp are to find on this support page of PDC: http://www.pdc.kth.se/resources/software/file-transfer/file-transfer-with-ssh
2013-07-25 at 09:53
We will execute maintenance work and upgrade the OpenAFS installation at PDC during the day due to severe security problems that have been published last night. It can be the case that you experience some service interruptions during that time, howeve, they should be comparatively short.
2013-07-25 at 08:00
Access to PDC's OpenAFS filesystem is not possible from outside until further notice. Yesterday evening have been published severe security problems that made it necessary to to protect the systems and the data on it with this measure. You can access your data if you login to PDC or if you use scp via PDC servers.
2013-07-22 at 09:41 [xxx (Ferlin)]
The processing of the stopped batch jobs begun (we reported yesterday). The login node "ferlin.pdc.kth.se" has been pointed to another hardware. It can take a few ours (up to one day) until all DNS servers in the world from wherever you try to use PDC will be updated. Please use the name "ferlin3.pdc.kth.se" in the case that you get an error message if you try to login to "ferlin.pdc.kth.se"
2013-07-21 at 07:24 [xxx (Ferlin)]
The login node of Ferlin became inaccessible last night. Please use the login node "ferlin3.pdc.kth.se" until further notice. Jobs that have been submitted but were not yet running until yesterday cannot start at the moment. Their processing will be probably continue tomorrow after a repair of the defective server. However, you can submit new jobs now, they will start now.
2013-07-16 at 15:34
Different networks within KTH seem somewhat isolated to each other. Users with nada home catalogues might not be able to access their home catalogues using PDC systems (networks.)
2013-07-04 at 00:28 [xxx (Zorn)]
2013-07-06 0900-1200 Break of regular operation for maintenance.
2013-06-27 at 09:44 [klemming]
Klemming is on-line since roughly an hour, and lindgren is up and running since roughly half an hour.
2013-06-26 at 21:41 [klemming]
the Klemming upgrade is still in progress, albeit taking longer than expected. Once finished other systems (i.e. lindgren) can be powered up again.
2013-06-19 at 16:27 [klemming]
the file-system Klemming (/cfs/klemming/) will get a software upgrade planned to start 2013-06-26/10:00:00. As it will be off-line during the upgrade all systems using klemming (lindgren, povel, ellen, ..) will be affected.
2013-06-04 at 11:35 [xxx (lindgren)]
The login node will be restarted around noon. Accesses to /cfs/klemming/ has gotten sluggish. Running or queued jobs will not be affected.
2013-05-16 at 13:12 [xxx (lindgren)]
To free locked up resources the login node will shortly be rebooted. Running/queued jobs will not be affected.
2013-05-15 at 19:26 [xxx (Ferlin)]
the Ferlin login node got stuck and is being restarted.
2013-05-05 at 17:14 [xxx (Ekman)]
Ekman är landat och batchsystemet stängdes av. Då tackar vi för oss och önskar en god fortsättning någon annanstans.
2013-03-28 at 11:58
A notification of hightened risk. On Tuesday, the 2nd of April, at 10AM we will replace a broken disk controller for the file system Klemming. While we don't expect this to be noticable to users nor running jobs due to redundancy in the setup, these procedures always mean a hightened risk of hiccups.
2013-03-27 at 22:19 [xxx (lindgren)]
Due to what likely is a series of batch jobs running amok, all available default space for job stdout/stderr output was suddenly filled up. As this has never happend before over the past 2+ years we are somewhat uncertain of the damage made to other jobs. Likely your job logs have been garbled, and possibly jobs have taken a pause in their execution. We have cleaned up the space and will soon enable job starts again. Please report any observations of unusal job behaviours.
2013-03-12 at 23:24 [xxx (lindgren)]
Physical move of new Klemming is completed, lindgren is started up and running jobs.
2013-03-11 at 17:15
The physical decommissioning of the old Klemming hardware and subsequent move of the new Klemming is taking more time than expected and unfortunately won't be finished today. We expect to have it back on-line an all systems sometime tomorrow afternoon.
2013-03-11 at 09:52 [xxx (lindgren)]
As earlier announced, lindgren will soon go off-line during the Klemming file-system upgrade. This will take most parts of today.
2013-03-08 at 22:25
The final part of the upgrade of the file system Klemming will be done on Monday, the 11th of March, starting at 10AM. The downtime is expected to last most of the day and there will be no access to Klemming from any system during this time, including Lindgren, Ellen, Povel and cfs-aux-4. For Lindgren this means the whole system will be down during the outage. Additional work will also be done to the power distribution while the system is down.
2013-03-08 at 16:07 [xxx (lindgren)]
Lindgren available and running jobs again.
2013-03-08 at 14:30 [xxx (lindgren)]
While trying to get the off-line cabinet back into the system, there were a few longer than expected timeouts, which caused several running jobs to fail. Our appologies. As there anyhow will be a full system stop during the final steps of the klemming move, we will postpone further attempts and run with slightly reduced capacity until then.
2013-03-07 at 21:43 [xxx (lindgren)]
We have resumed job starts but kept 1 cabinet in a powered off state. This will have slight impact on the performance of the high speed interconnect, and thus the performance for certain applications.
2013-03-07 at 12:59 [xxx (lindgren)]
Electrical work caused cooling to fail and 3/16 of the system has executed emergency power off. Likely most running jobs have been affected. We will asses the situation on electricity/cooling and later on likely restart the system.
2013-03-06 at 11:04 [xxx (Ferlin)]
The scheduler now runs on another piece of hardware. No running or queued jobs should have been affected. The backlog of pending requests (job submissions, job releases, ..) is quite large but gradually processed.
2013-03-06 at 10:20 [xxx (Ferlin)]
The machine scheduling jobs for ferlin has a hardware failure. No job changes, or submissions, can be processed at the moment but will queue up and processed once we have a replacment on-line.
2013-03-05 at 17:15 [xxx (Zorn)]
On cluster Zorn will be a larger maintenance to do some mechanical work on March 13th until March 14th. User access is not possible during these two days.
2013-03-04 at 18:45 [xxx (lindgren)]
forwarding info for users with CSC homes (/afs/nada.kth.se) System maintenance tonight CSC's file servers offline Tuesday 5:45 pm, March 4 Urgent system maintenance requires us to temporarily shut down CSC's AFS servers. This will affect many other CSC systems. In particular the home directories of all CSC users will not be available. Normal service should be restored by later this evening. CSC Systems Group
2013-03-04 at 17:43 [xxx (Ellen)]
The system experiences problems at the moment and cannt run stable for a longer time. User access has been suspended until the problem could be investigated in more detail.
2013-02-27 at 12:40 [xxx (lindgren)]
Lindgren is on-line again. Information of the progress of the migration of data from the old /cfs/klemming/ to the current /cfs/klemming/ will be sent out separatelly.
2013-02-27 at 10:03 [xxx (lindgren)]
This is a reminder that the Klemming file system will be upgraded today starting at 10AM, as previously announced. This means that all systems where Klemming (/cfs/klemming) is available will have downtime starting at 10AM. This includes Lindgren, Povel, Ellen, cfs-aux-4, griffin and koaro.
2013-02-27 at 09:50
This is a reminder that the Klemming file system will be upgraded today starting at 10AM, as previously announced. This means that all systems where Klemming (/cfs/klemming) is available will have downtime starting at 10AM. This includes Lindgren, Povel, Ellen, cfs-aux-4, griffin and koaro.
2013-02-25 at 15:55
On Wednesday, the 27th of February, the file system Klemming, /cfs/klemming, will be upgraded to new hardware/software. This requires that all data in the current file system be copied to the new one. Since the copying will take days, to minimize the amount of time that files are inaccessible to users and downtime for the computional resources, we have decided to do the copying with all systems on-line for most of that time. At 10AM all systems mounting Klemming will have at least a couple of hours of downtime to switch to the new, empty, file system and we will start copying the data from the old system to the new. We will copy user by user, and as soon as a users files show up in the new file system, that user can start to queue up jobs again, which can be started by PDC staff manually. Since we expect the number of files to be one of the most limiting factors for the speed of the copying, we will start with the users with the fewest files relative to their amount of data. To minimize the total time for the copying, all users are asked to go through their files in Klemming again and delete all unnecessary data before the copying starts. Users who also want to move data out of Klemming should remember to do so using the transfer nodes (currently cfs-aux-4.pdc.kth.se, more to come) rather than the login nodes. After the copying is complete there will be another downtime of around half a day to finalize the upgrade. This downtime will be announced when we know when the copying will finish, but at least 24h in advance.
2013-02-22 at 15:33 [xxx (Zorn)]
The batch system operation has been restarted today.
2013-02-18 at 09:42 [xxx (lindgren)]
As several have noticed, login was blocked during parts of the weekend. There now is a workaround in place to address the issue, and login have been enabled again on lindgren (Other systems will follow.)
2013-02-15 at 16:25
Unfortunately the upgrade of klemming has been rolled back due to issues discovered during the upgrade process. To reduce the downtime of Lindgren we have decided to re-enable access to the current klemming file system and reschedule this upgrade for mid or end of next week. The exact day will be announced as soon as the issues we are seeing has been reliably resolved.
2013-02-15 at 10:04 [xxx (lindgren)]
As earlier announced we will now bring down lindgren during the initial phase of the /cfs/klemming upgrade.
2013-02-10 at 19:20
On Friday, the 15th of February, the file system Klemming, /cfs/klemming, will be upgraded to new hardware/software. This requires that all data in the current file system be copied to the new one. Since the copying will take days, to minimize the amount of time that files are inaccessible to users and downtime for the computional resources, we have decided to do the copying with all systems on-line for most of that time. At 10AM all systems mounting Klemming will have up to a couple of hours of downtime to switch to the new, empty, file system and we will start copying the data from the old system to the new. We will copy user by user, and as soon as a users files show up in the new file system, that user can start running jobs again. Since we expect the number of files to be one of the most limiting factors for the speed of the copying, we will start with the users with the fewest files relative to their amount of data. To minimize the total time for the copying, all users are asked to go through their files in Klemming and delete all unnecessary data before the copying starts. Users who also want to move data out of Klemming should remember to do so using the transfer nodes (currently cfs-aux-4.pdc.kth.se, more to come) rather than the login nodes. After the copying is complete there will be another downtime of a few hours to finalize the upgrade. This downtime will be announced when we know when the copying will finish, but at least 24h in advance.
2013-02-08 at 15:23 [xxx (Zorn)]
The batch operations on the cluster Zorn remains stopped during the weekend for more tests of the system stability.
2013-02-07 at 11:53 [xxx (Zorn)]
Cluster Zorn: The batch processing has been restarted.
2013-02-06 at 16:39
The Klemming file system has now been fully up again for a while and the server restart seems to have cleared what looks like a software bug in the file system. The first indications of this problem in the logs are from around 02:40 this night. Some users have reported IO errors reading some files during these problems. If you are still experiencing problems, please let us know.
2013-02-06 at 13:31
The file system Klemming is currently recovering from a possible file system bug and subsequent restart of one of the servers.
2013-02-05 at 14:56 [xxx (Zorn)]
Cluster Zorn's Lustre file system is available using another fileserver. Nevertheless, we cannot be sure about the standup time at the moment. Therefore, please backup all data you think to be valuable on the filesystem "/cfs/zorn/nobackup/...". Batch system operation will start in about 24 hours again late afternoon Feb 6th) in order to avoid any immediate new production of significant data amounts until there was a chance to backup existing data.
2013-02-04 at 16:51 [xxx (lindgren)]
as /cfs/klemming seem in better shape, ordinary scheduling has been resumed. Please report unexpected behaviour.
2013-02-04 at 16:11 [xxx (Zorn)]
Cluster Zorn: One filesystem server has to be replaced after repeated crashes. Information will be posted here as soon the batch operation can continue.
2013-01-31 at 16:17 [xxx (Zorn)]
Cluster Zorn: The repair of the filesystem has been done. Batch operation works again.
2013-01-31 at 11:54
Due to delays in delivery of the replacement controller for the Klemming file system, the performance of Klemming is still reduced and it is more easily overloaded. The tuning that has been done on the software side has not been able to aleviate these problems. Please refrain from running I/O intensive applications, especially random access workloads, until the system is fully functional again. We have now been promised delivery maybe tomorrow, on Monday at the latest. In parallel with these efforts we have also started to install an upgrade of the hardware behind Klemming that should give us more headroom to prevent these kinds of problems in the future. More information about this soon.
2013-01-29 at 13:44 [xxx (lindgren)]
preventive maintenance is finished. We will open up login and gradually let batch jobs execute.
2013-01-29 at 08:41 [xxx (Zorn)]
2012-01-29, about 10:00 - 11:00 CET: No access to Zorn due to repair work on the file system.
2013-01-29 at 08:03 [xxx (lindgren)]
the planned preventive maintenance will start now.
2013-01-28 at 12:20 [xxx (Zorn)]
Cluster Zorn: the problems with the lustre file came back. The investigation continues as mentioned in the last news. We wait for another spare part and let the filesystem in a read-only state in order to avoid that problems increase in an irreversible way.
2013-01-27 at 15:45 [xxx (lindgren)]
The filesystem /cfs/klemming is malfunctioning at the moment. You will experience problems to access files, processes hang and you cannot work properly in interactive shell sessions if the commands address the filesystem.
2013-01-25 at 14:40 [xxx (Zorn)]
The Lustre filesystem of the cluster Zorn is available again. Nevertheless, we suppose that the problems could be caused by some instable hardware and continue to investigate the fileservers in cooperation with the supplier.
2013-01-25 at 09:59 [xxx (Zorn)]
Cluster Zorn: The filesystem is not writeable at the moment. Investigation is ongoing.
2013-01-24 at 17:13 [xxx (Zorn)]
The cluster Zorn is back in normal operation.
2013-01-24 at 10:46 [xxx (Zorn)]
I got the information that the spare parts will today or latest tomorrow in PDC. So we can expect be in operation again soon.
2013-01-23 at 22:50 [xxx (lindgren)]
Lindgren is back on-line and running jobs. As we have a planned preventive maintance session starting coming Tuesday it will run with slightly reduced capacity until then. Please report strange behaviour.
2013-01-23 at 22:07 [xxx (lindgren)]
Due to what currently appears to be hardware-failure after the power outage earlier today Lindgren is still down. More information will be forthcoming
2013-01-23 at 17:05 [xxx (Zorn)]
The cluster Zorn remains out of operation. A server has been damaged due to the power outage. We will decide further measures tomorrow when we have information about the delivery of spare parts.
2013-01-23 at 12:53 [xxx (Ekman)]
Ekman is back in batch operation. All previously running jobs were terminated due to the power outage. Please re-submit your requests.
2013-01-23 at 11:44
Due to a power failure, all clusters at PDC will be shut down shortly. Currently, there is no information on when power will be restored.
2013-01-22 at 16:20 [xxx (lindgren)]
Preventive maintenance, Tuesday 2013-01-29/08:00. Lindgren will go off-line. We expect to go on-line again during the afternoon.
2013-01-16 at 16:38 [xxx (Zorn)]
The batch operation is active again.
2013-01-16 at 11:18 [xxx (Zorn)]
The general batch system operation has been stopped now and will continue today at 16:30 o'clock again.
All flash news for 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss