Christmas holidays are approaching. The PDC helpdesk will be closed from Dec 22nd, 2003. It will open again on Jan 7th, 2004
2003-12-09 at 22:35 [xxx (strindberg)]
Old SP system; restart and verification complete. further testing of gpfs/projects will be performed with the filesystem off-line (unavailable) later this week.
2003-12-09 at 14:49 [xxx (strindberg)]
Old SP system; restarting gpfs on all nodes and also eventually rebooting the log in node.
2003-12-08 at 07:35
Relocating: the telephone routing might experience hickups during the relocation of our offices.
2003-12-08 at 07:29 [xxx (strindberg)]
/gpfs/projects on the old SP. There is a disk that is reporting errors. As the machine is to be retired, the files by default replicated, and there are backups, we will take no action repairing the /gpfs/projects. In case you have changed the replication factor your file might contain unreadable sections (I/O error.)
2003-11-29 at 15:54
The fileserver should now be back on-line.
2003-11-28 at 18:34
One AFS server has crashed; unknown reason. Salvage will probably take a while.
2003-11-27 at 11:49
All production systems: Node allocation paused on production systems during, and until, secondary server move complete.
2003-11-18 at 19:42
PDC is relocating during November 2003 to Februray 2004. Please find further information through the 'Upcoming Events' page.
2003-11-08 at 00:36 [xxx (SBC / CBR)]
The SBC AFS file server has crashed due to faulty hardware. Further investigation will start on Saturday morning. Because of the crash, the SBC cluster queue has been temporarily halted.
2003-11-07 at 14:45 [xxx (lucidor)]
Lucidor: the log in node, blumino, will be rebooted to resolve paging problem.
2003-11-04 at 15:49
... but apparently not. Repeat until false.
2003-11-04 at 15:00
One of the AFS fileservers had to be restarted. Everything should be back to normal now.
2003-11-06 at 13:00 [xxx (HSM)]
The HSM system will be reconfigured on Thursday at 13:00. It should only be down for an hour or so (or so goes the plan).
2003-10-28 at 14:44 [xxx (HSM)]
HSM system is back up again after a change of powersupply.
2003-10-27 at 14:38
Network maintenance prior to PDC move planned for tomorrow, 2003-10-28, between 1300 and 1500. No side effects expected.
2003-10-24 at 13:50 [xxx (HSM)]
Due to a server failure the HSM system will be unavailable at least until monday.
2003-10-24 at 13:00 [xxx (HSM)]
Due to a disk failure the HSM disksystem is being replaced.
2003-10-19 at 12:52 [xxx (strindberg)]
The nighthawk log in node did crash earlier on today. It has been recovered.
2003-10-15 at 19:30 [xxx (lucidor)]
Job startup code slightly modified. Please report any unexpected behaviour to
2003-10-13 at 12:38 [xxx (strindberg)]
Nighthawk - the interactive nighthawk is temporarily unavailable.
2003-10-12 at 20:01 [xxx (lucidor)]
Lucidor: Reboot the log-in node (blumino/h06n05.)
2003-10-06 at 20:50 [xxx (lucidor)]
Upgrade complete. Note that you might have to relink your code in case of gm (myricom) or kernel dependencies. 'module add mpich' gives proper mpich/gm default path. Please report any strange behaviour to!
2003-10-03 at 17:17 [xxx (lucidor)]
All nodes will be upgraded to linux kernel version 2.4.22 Monday 2003-10-06. We will at the same upgrade to myricom gm 2.0.6. You might have to relink your code if you are using the myrinet.
2003-09-24 at 21:36 [xxx (strindberg)]
The power2 and ppc sections of strindberg are back in line.
2003-09-24 at 09:38 [xxx (lucidor)]
System work on Lucidor. Operator is enforcing an allocation pause.
2003-09-24 at 09:38 [xxx (strindberg)]
The pwr2 part of Strindberg is still down. Investigation is in progress.
2003-09-23 at 17:30 [xxx (strindberg)]
The Power 2 part of Strindberg is currently down. The system probably won't be accessible again until some time tomorrow.
2003-09-12 at 10:53
The SSH server will be brought down today (September 12th) for hardware and software upgrade.
2003-08-22 at 11:22 [xxx (SBC / CBR)]
Updated Intel compilers to 7.1-31 build 20030813
2003-08-18 at 15:29
The fileserver process on alanine crashed 20 minutes ago. Salvaging is in progress, but might take considerable time.
2003-08-11 at 17:10
At 17:10 a short network outage will occur for some systems at PDC in order to upgrade the router software of pdc1-gw to a more recent level. Most services will still be available through pdc2-gw. The planned duration is under 5 minutes
2003-08-04 at 09:47 [xxx (SBC / CBR)]
The scheduler node is down. Investigation in progress.
2003-07-30 at 10:13
The license server will be rebooted at 14.00 on Thursday 31/7. During the reboot software licenses will be unavailable.
2003-07-30 at 10:03 [xxx (HSM)]
The HSM server will be rebooted at 14.00 on Thursday the 31/7. Service downtime should only be a few minutes.
2003-07-28 at 15:57 [xxx (SBC / CBR)]
Updated Intelcompilers and Intel MKL libraries
2003-07-22 at 08:59 [xxx (SBC / CBR)]
We are experiencing some scheduling problems with nodes flapping up and down. Investigation in progress.
2003-06-30 at 21:47
We regret to announce that PDC on-call service is discontinued from 1 July. We do not anymore guarantee that malfunctioning systems at PDC are repaired during holydays, nights, and weekends. For further information and questions contact Per Öster, +46 8 790 6261. Please, let us know about any inconvenience that this decision will cause you as a PDC-user.
2003-07-09 at 15:00
At 2003-07-09 15:00 we will restart our AFS servers for upgrade of the AFS server software. Queues will be stopped and you may not be able to access your home directory for a short while. Duration 10 minutes (if sucessful) to 5 hours (if unsuccessful).
2003-06-27 at 10:13 [xxx (SBC / CBR)]
Allocation is paused due to the AFS server problems.
2003-06-27 at 09:34 [xxx (SBC / CBR)]
The AFS server crashed again about 10 minutes ago. We're working on it.
2003-06-26 at 11:57 [xxx (SBC / CBR)]
Schedule pause tonight because of AFS server problems. Sorry for the short notice. /Harald.
2003-06-26 at 09:49 [xxx (SBC / CBR)]
The AFS server crashed again yesterday evening. Investigation and repairs are in progress.
2003-06-25 at 13:07 [xxx (SBC / CBR)]
One AFS server is having problems. Investigation going on.
2003-06-24 at 09:29 [xxx (strindberg)]
/gpfs/scratch on the interactive nighthawk node (nf01n05) is currently unavailable. Investigation is in progress.
2003-06-19 at 13:47
PDC Helpdesk is closed for holiday 2003-06-20 (Midsummer's eve). We reopen at 08:00, 2003-06-23.
2003-06-09 at 15:00 [xxx (strindberg)]
One node serving gpfs/bins is gone bad and data residing on that node is not available until repaired.
2003-06-07 at 10:00 [xxx (strindberg)]
One log in node on the old SP system (strindberg) had its resources overused; Jobs connected to the node were not able to start until it was restarted.
2003-06-06 at 17:03 [xxx (strindberg)]
The broken node has been replaced and all files should now be available again.
2003-06-06 at 16:00 [xxx (strindberg)]
One node serving gpfs/projects and gpfs/scratch is gone bad and data on that node is currently not available.
2003-05-30 at 10:58
PDC Helpdesk is closed for holiday from 11:00, 2003-05-30. We reopen at 08:00, 2003-06-02.
2003-05-15 at 18:00
Due to an electrical rework starting 2003-05-21 at 2100 we will as a precaution put several production systems on hold. Some filesystems, i.e., gpfs will also be unmounted during the rework, this to extend UPS battery lifetime. The rework itself is supposed to take 12 minutes, resuming all operations will take longer.
2003-05-13 at 13:52
PDC:s main mailserver is currently down.
2003-04-30 at 12:00
PDC Helpdesk is closed for May First from 12:00, 2003-04-30. We reopen at 08:00, 2003-05-05.
2003-04-27 at 15:56 [xxx (HSM)]
Due to a hardware failure, the HSM system won't be able to fetch files that reside on tape. Files that are already on disc will still be accessible and new files can be added as long as there is free space on the discs.
2003-04-24 at 14:00 [xxx (HSM)]
The HSM server will receive an OS upgrade and will be down for two hours.
2003-04-16 at 16:51 [xxx (strindberg)]
Maintenance/rearrangement of the old SP will be performed over the Easter holidays. The availability of resources within it will vary.
2003-04-17 at 12:00
PDC helpdesk is closed for Easter Holidays from 12:00, 2003-04-17. We reopen at 08:00, 2003-04-22.
2003-04-14 at 12:54
boye has a powersupply failure -> VRcube down. Service requested from SGI.
2002-04-10 at 10:23 [xxx (SBC / CBR)]
The AFS problem is NOW. Previous event has incorrect date-tag.
2002-03-26 at 09:53 [xxx (SBC / CBR)]
One SBC AFS server is confused, causing some volumes to be unavailable. Investigation in progress.
2003-04-08 at 14:24
Informational: Users with home at /afs/ The AFS servers at Nada will be upgraded on 9 April starting at 18:00. Other nada-services will also be affected. Pure PDC users should not be affected.
2003-04-04 at 13:55 [xxx (strindberg)]
Nighthawk: new default IBM C and Fortran compilers has been changed to vac 6.0 and xlf 8.1.
2003-03-31 at 13:00 [xxx (strindberg)]
Nighthawk:node serving parts of /gpfs/scratch back online. The filesystem should operate normally.
2003-03-30 at 15:55 [xxx (strindberg)]
Nighthawk: one node serving /gpfs/scratch is signaling a power supply fault. Reduced capacity/availability of nighthawk:/gpfs/scratch.
2003-03-26 at 17:27 [xxx (strindberg)]
CFD program Fluent 6.1.18 installed
2003-03-24 at 16:27 [xxx (SBC / CBR)]
At 17:00 20030403 the SBC-cluster login node will be taken down for physical relocation. The downtime is estimated to 30 minutes. Please use during the move.
2003-03-24 at 15:25 [xxx (linux lab)]
NAGWare Fortran Tools installed
2003-03-21 at 17:15 [xxx (strindberg)]
Nighthawk: the interactive nf01n05 is put into service.
2003-03-21 at 16:52 [xxx (strindberg)]
Nighthawk: same node did dump again. Note: the affected filesystem should read /gpfs/scratch!
2003-03-21 at 15:16
Fileserver carp restarted due to excessive load.
2003-03-20 at 23:08 [xxx (strindberg)]
2003-03-20 at 12:30 [Strindberg] Nighthawk: one node serving /gpfs/projects did once again dump causing a temporary unavailability of /gpfs/projects. Once again resumed. Investigating more thoroughly.
2003-03-20 at 12:30 [xxx (strindberg)]
Nighthawk: one node serving /gpfs/projects did dump causing a temporary unavailable /gpfs/projects. Now resumed.
2003-02-21 at 18:30 [xxx (SBC / CBR)]
afs - overcame problems related to testing of new sbc afs-servers. scheduling resumed.
2003-02-15 at 13:45 [xxx (strindberg)]
Old SP; there are problems with the HA subsystem. node allocation is paused until the problem is resolved.
2003-02-10 at 17:15 [xxx (selma)]
The disk holding the /scratch partition has been scratched for good - a controller card broke down and will not be repaired. NQS will be stopped until the system is reconfigured to run with another (smaller) /scratch.
2003-02-10 at 12:30 [xxx (strindberg)]
Nighthawk; one frame (with K-nodes) has power supply problems.
2003-02-04 at 09:34 [xxx (linux lab)]
Allocation paused. We will do some small network adjustments in the internal (ethernet) network of the cluster during the day.
2003-02-04 at 00:06 [xxx (strindberg)]
Strindberg (old SP system): as there are aftershake-problems with parts of the hardware after the power outage, we will insert several blanks in the node allocation in the coming days. Also, please do report eventual problems.
2003-02-03 at 18:34 [xxx (strindberg)]
Switch instabilities on the old SP system; jobs on the old system most certainly affected.
2003-02-03 at 15:02 [xxx (strindberg)]
Scheduling resumed on both SP systems. Please report any bogusities to
2003-02-03 at 12:09
AFS recovery completed. Still holding queues because we have not verified the batch functionality yet. Rumours at KTH say that a faulty transformer on the KTH grouds was to blame for the outage.
2003-02-05 at 00:30
Major power outage at KTH. Will take some time. Working on getting up servers / haba.
2003-01-31 at 01:24 [xxx (SBC / CBR)]
A non-responding fileserver process on one afs-server has been restarted.
2003-01-28 at 10:42
We're having routing problems which cause a lot of dropped traffic in to and out of PDC at present. We are working on fixing the problem.
2003-01-24 at 15:25 [xxx (strindberg)]
Kerberos 5/Heimdal upgrade on the system. If you get the message: "kauth: unparsable time: -1" when acquiring long lasting tickets on the SP. Use the command
kauth -l 1y
instead of -l -1
2003-01-24 at 15:25 [xxx (strindberg)]
MASS libraries (Mathematical Acceleration Subsystem) updated to version 3.2. This includes both scalar and vector versions.
2003-01-15 at 13:19 [xxx (strindberg)]
/gpfs/projects is getting full. Please remove unneeded files. Note that the nighthawk system is not affected.
2003-01-09 at 13:19 [xxx (strindberg)]
Strindberg/Nighthawk; nf01n05 (shared/interactive) ran out of paging space and user processes was terminated.
