Events:

2015-12-10 at 08:00
Tegner service window Thu 2015-12-10 Please note that Tegner will have a service window 10/12 starting at 08:00 with a planned end at 17:00 or earlier. The most important change implemented will be enforcement of allocation boundaries to improve the queuing situation. Please don't hesitate to contact support@pdc.kth.se if you have any questions.
2015-11-13 at 12:25 [beskow]
in conjunction with ordinary hardware maintenance one of the batch front-ends stopped responding and needs to be restarted. All jobs run from that batch front-end will be killed.
2015-10-14 at 16:01 [klemming]
The problems with Klemming have hopefully been resolved for now. That said, we still don't know what caused the problems, since it doesn't seem to be the same as before, although some of the symptoms were similar. If you notice something that is still strange, let us know.
2015-10-14 at 13:02 [klemming]
Some parts of Klemming are having problems somewhat similar to previous problems. We will now initiate some recovery procedures that will cause the filesystem to be a bit more shaky for a while.
2015-09-30 at 21:30 [klemming]
The klemming work is finished and jobs are about to be released again.
2015-09-29 at 15:15
The updates and reconfiguration of Klemming is taking more time than expected and Klemming probably won't be available until tomorrow, hopefully before lunch. Because of this Tegner and Beskow won't be back until then either.
2015-09-17 at 12:37 [klemming]
restart completed, operation back to normal.
2015-09-17 at 10:12 [klemming]
We are now restarting the Klemming servers, which will make the system even more shaky for a while.
2015-09-16 at 21:11 [klemming]
the Klemming file-system is experiencing similar problems as those of last week. We will pause job starts on beskow for the time being and initiate recovery/restart of the file-system tomorrow morning.
2015-09-16 at 16:29 [klemming]
During Monday and Tuesday September 28th-29th, the Klemming filesystem will be unavailable for service. We plan to do some filesystem reconfiguration, fw updates and recabling to improve reliability and performance. The filesystem will be unavailable starting at 09:00 on Monday and is expected to be back some time Tuesday afternoon, at the latest.
2015-09-08 at 15:32 [klemming]
Due to more servers starting to show problems, we decided to restart all the rest of the servers also, preventively. Klemming should now be back to normal. Please report anything out of the ordinary.
2015-09-08 at 10:19 [klemming]
Parts of Klemming is having similar problems as yesterday. We will now restart those servers also, which will temporarily affect that part of Klemming. Hopefully this will clear out the last remaining problems.
2015-09-07 at 13:59 [klemming]
the parts of Klemming experiencing problems have been restarted. Operation should be back to normal.
2015-09-07 at 08:51 [klemming]
The Klemming file system is having some problems. Investigation in progress.
2015-09-05 at 23:25 [beskow]
a handful of compute nodes, and the login node, of beskow seem to experience random sluggish behavior when accessing klemming files. It is not determined whether this is related to beskow alone, the paths between beskow and klemming, klemming alone. Batch jobs still run as usual.
2015-09-02 at 09:17 [beskow]
what seem to be an automated procedure did send in batch jobs that unfortunately were designed to execute on the batch front-ends rather than on compute nodes. This caused all batch front-ends to choke and they are now being restarted.
2015-06-24 at 18:55 [beskow]
preventive maintenance and software updates on beskow completed. The system is running jobs again.
2015-06-12 at 15:46 [beskow]
preventive maintenance and software updates on beskow planned to start Tuesday June 23 at 08:00. We aim to be finished and have the system back within two days.
2015-05-05 at 11:48 [beskow]
system is running jobs again.
2015-05-05 at 10:45 [klemming]
Unfortunately the service took longer than expected, but Klemming is now available again. Please report anything out of the ordinary.
2015-04-30 at 10:24 [beskow]
/cfs/rydqvist has been repaired and beskow should work as usual.
2015-04-30 at 09:38 [beskow]
Starting early this morning there are problems with /cfs/rydqvist, the file-system where applications are stored.
2015-04-28 at 14:37 [beskow]
Beskow will be shutdown and get a system software update Monday 2015-05-04, starting at 09:00. The update is expected to take all of Monday and parts of Tuesday the day after.
2015-04-28 at 14:29 [klemming]
The Klemming file system will be unavailable for system work during Monday the 4th of May, starting at 9AM CEST. We will update the system for better reliability and performance. We expect the file system to be back some time in the afternoon. This affects all systems mounting Klemming.
2015-04-14 at 08:10 [beskow]
the login node is behaving poorly and will be restarted.
2015-04-07 at 09:30 [milner]
The milner login node was behaving bad and has been restarted.
2015-03-26 at 11:35 [klemming]
We have a similar situation as that of two days ago. We will initiate a failover soon.
2015-03-24 at 10:40 [klemming]
/cfs/klemming got a hickup starting around 09:50, which probably did affect several jobs. We will probably force a so called failover to get into a clean state. During the failover file-system operations will block.
2015-03-09 at 16:59
Email should now work again. Some emails might have been lost, so if you sent an email to PDC today it is best if you re-send it to make sure we receive it.
2015-03-09 at 12:13 [beskow]
preventive maintenance beskow completed, maintenance klemming completed. System is running jobs again.
2015-03-09 at 11:56
The PDC mail server is having problems, so all pdc mail addresses (including support@pdc.kth.se) will quite likely bounce. For urgent issues use the support phone number +46 (0)8 790 7800
2015-03-06 at 14:46 [beskow]
preventive maintenance coming Monday, 2015-03-09/09:00. We will flash bios on beskow to improve logging of some hardware events, and do minor hardware replacements in /cfs/klemming. We expect this to take 2..4 hours.
2015-03-05 at 16:53 [klemming]
All of Klemming have now been restarted. Hopefully this have killed off all the ghosts in the servers, but the root cause of the problems are still not entirely clear. Please report anything out of the ordinary.
2015-03-05 at 11:21 [klemming]
We are now doing some more invasive operations to Klemming to try to resolve the current problems. This means more noticable hickups for a while.
2015-03-05 at 00:11 [klemming]
Parts of Klemming is again having some problems, similar to the ones earlier today. Investigation continues but the root cause is still unknown.
2015-03-04 at 16:43 [klemming]
Immediate problems with Klemming now resolved. One of the servers was restarted to clear broken state information that caused some files to be inaccessible. Please report anything still out of the ordinary.
2015-03-04 at 14:20 [klemming]
Parts of the Klemming file system is experiencing problems right now, causing access to files to hang on some machines. We are investigating the cause of these problems now.
2015-03-02 at 10:44 [beskow]
The file-system with applications has been recovered, and jobs are allowed to start again. Likely many jobs relying on applications supplied by PDC have experienced problems since around 23:00 yesterday, 2015-03-01.
2015-03-01 at 23:44 [beskow]
no new jobs will start from now on; file-system with applications acting strange over the past 2 hours.
2015-02-18 at 10:21 [beskow]
the system got a kernel update. Old klemming has been taken out.
2015-02-16 at 12:42 [beskow]
system will go off-line 2015-02-18/09:00 for preventive maintenance, and permanent removal of the -old- klemming file-system.
2015-02-13 at 17:09 [milner]
The login node has been rebooted. Please report unexpected behaviour.
2015-02-13 at 15:09 [milner]
the milner login node is behaving bad. We will let jobs running on it finish and then reboot it. This will happen during late afternoon, or early evening.
2015-02-11 at 21:34 [beskow]
Todays system work, and klemming work, completed. Jobs are running and login enabled again.
2015-02-10 at 15:00 [klemming]
More detailed info about the upcoming service window tomorrow for Beskow and the Klemming file system. At 09:00 Klemming will be unmounted on all systems, and will stay unmounted until the afternoon, when the upgraded version of Klemming will be mounted instead (at /cfs/klemming as usual). At this time, some users will not be fully copied and will therefore not have directories in Klemming yet. As soon as their data is copied, it will show up in the new file system and they can start running jobs again. To make sure that jobs already in the queue will still work, /cfs/stub1 will be a link into /cfs/klemming for a limited time. Users are urged to instead start using /cfs/klemming as before as soon as possible. The /cfs/stub1 link is planned to be removed at the next convenient service window on Beskow. Some large users have already been contacted and agreed that their data is copied off-line, which is greatly appreciated. If there are other users that won't be completed before Thursday noon, they will at that time be contacted directly with more information how to run jobs until their data is fully copied.
2015-02-10 at 11:09 [beskow]
there were issues in communication between slurm and alps. These hopefully are resolved now.
2015-02-10 at 10:38 [beskow]
Hi, we are currently having some troubles with slurm on Beskow and are investigating the issue. We will notify you when we have more information.
2015-02-07 at 10:25 [klemming]
Re-issued to reach beskow-users. In preparation for taking the upgraded Klemming into production, we are now starting to copy users data to the new Klemming at a higher rate. This will put a higher load on Klemming, which might be noticable for some jobs that do a lot of I/O. The data of inactive users that have already been copied will now start to disappear from Klemming. If this happens to you and you are an active user, let us know and we will restore the data for you. If all goes well, we plan to make the new Klemming available as /cfs/klemming next week. Tentative maintenance stop on Beskow to do this is planned for Wednesday starting at 09:00. We will evaluate the progress of the increased copying rate after the week-end and then decide exactly how we will proceed with the copying process to limit the impact on production. More info about this and how it will affect users/jobs will be sent out on Monday.
2015-02-07 at 03:20 [klemming]
In preparation for taking the upgraded Klemming into production, we are now starting to copy users data to the new Klemming at a higher rate. This will put a higher load on Klemming, which might be noticable for some jobs that do a lot of I/O. The data of inactive users that have already been copied will now start to disappear from Klemming. If this happens to you and you are an active user, let us know and we will restore the data for you. If all goes well, we plan to make the new Klemming available as /cfs/klemming next week. Tentative maintenance stop on Beskow to do this is planned for Wednesday starting at 09:00. We will evaluate the progress of the increased copying rate after the week-end and then decide exactly how we will proceed with the copying process to limit the impact on production. More info about this and how it will affect users/jobs will be sent out on Monday.
2015-01-21 at 11:32 [xxx (lindgren)]
No jobs will execute beyond Sun Jan 25 12:00:00 2015, as a preparation of the system getting retired and login closes 12:00 on Monday 26th January.
2015-01-14 at 12:06 [klemming]
Things have settled down, and lindgren is running jobs again.
2015-01-13 at 22:41 [klemming]
As the klemming outages (as seen from a lindgren perspective) seem to continue no further jobs will be allowed to start on lindgren.
2015-01-13 at 18:22 [klemming]
During file-system cabling work, there was a large dropout of the klemming file-system. Several jobs likely got damaged and crashed. The work is now finished and operations resumed.
2015-01-07 at 11:54 [xxx (Zorn)]
Login to Zorn has been disabled in order to allow the investigation of a problem with the Lustre filesystem.
All flash news for 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss