Friday, December 28, 2007
Stanford has told us we must take vacation during the holidays as a cost-saving measure, and yet we must release software. So here we are! We get to gossip about the holidays.
Our Sakai / CourseWork upgrades are, on the whole, pretty easy. More time is taken giving the new deployment a live shakedown. Perhaps Mary Mak can manage to get automated testing in place - it would be great to have robots crawl the site after deployment.
The procedure was that QA did the final build for release, and Julian Morley dropped it onto a private preprod box. There he reviewed any property changes. If a table had changed Sam would of cut a clone of the production schema before hand. Julian would apply the Operations lens to the deployment and look for issues.
On release day QA would run any of the tested SQL necessary to prepare / update the databases, and then Operations would manage the servers.
Ops takes the Big5 Load Balancer thingy and puts up the standard "out of service" link. This goes to some static HTML in the Stanford www server pool. Then one of the tomcat pool is chosen as a candidate. The tomcat from the ops-private preprod box is dumped onto the candidate machine, and started up. The non-load-balanced machine name is used for smoke testing.
Cost cutting measures have lead to a loss of that level of software release process. The only real difference is that developers now do the final build & SQL tweaking. This provides the double edged sword of having people around who can tweak the deployment target as it is slammed into production. (When QA ran the show they would rightly punt the deploy of something had a bad smell. Now it can be hacked. You decide how you want to run your institution. :) )
In this case we're rollout a build containing a Sakai-provisional tool of our own making. It is an alternate "home page" tool intended to be used in course sites. Lydia is handling the DB conversion. We're also doing some CourseWork Classic conversion, and Julian has wrangled a blob dump. (we switched to extracted blobs long ago - summer? - but kept them in the DB. the dump now allows us to move the tables and recover the space from disk.)
When we're not doing database table changes lots of the ritual is disposed of. The process boils down to a tomcat drop, a restart and a smoke test.
I'm expecting this to go out without any operational difficulties.
Wednesday, December 19, 2007
To speed up login this past Fall we've taken a few steps. The first was to constrain Realm / Role resolution to the current Term. That was nice, as the Role Resolvers didn't have to crawl across all possible Term/Section combinations for Instructors. But then we pushed it further and removed all calls to refreshUser from the login sequence. ZIP. Due to the lack of Role checking of any sort our users got in quickly and our database as not being reduced to a heap of smoldering salmon. The user Site membership was pulled from SAKAI_SITE_USER w/out an issue, and when the user navigated to the site their roles were pulled. (Sakai recalculates this stuff all the time.)
At the end of the Term we've found that a set of users were loosing their site membership. It was those users who had had Support perform a "Become User" operation in resolving various issues. The Become User operation was using the Term aware changes we had added and stripping the User from all the past term Sites for which they were provided. Since the Term dates ( see the jira about 'Term effective dates' ) had passed, officially, the Sites for that Term were being stripped of those users who had Support issues warranting a Become User session.
Pretty darn confusing.
Especially as each CM Sync job, to work around other bugs in Sakai, was doing a Site Save and a toggle of Section Administration settings to keep the Realm info up to date. These pushed the users back INTO the sites they were in. Depending on when the user logged in they may or may not of had access to their previous quarters Sites.
Here is my log from the internal Sakai confluence site:
Working steps. done in my dev instance, against my dev oracle database.
- bring up Stan2.4.x_D
- login in as self
- review test site Su07-xoxo-001-01
- review Su07 term - it's not active. but my instructor membership in the site shows the site on my pulldown (the tab area is full)
this gives me a list of folks. I am the instructor and cwrks223 is a student.
I move to another computer and log in as admin/admin.
- login as admin/admin
- test cwrks223 account via SU 'LDAP Peek' to see if account looks OK. it does.
- SU to cwrks223
At this point the Site is not listed.
I move to the first computer
- on first computer I log out of Caseyd1 session
- log in as cwrks223
- review tabls *No Su07-XOXO-002
what's wrong with this? I didn't log in as cwrks223 FIRST. doh. notice that there was no syncing going on at this point.
- on second computer I log in as caseyd1
- I review membership in Su07-XOXO-001-01. cwrks223 is missing.
- using SQL Developer I examine database tables:
- user is still in CM data
- user is not in SAKAI_SITE_USER for this site
- user is not in REALMS for this site
select * from sakai_realm
where realm_key in
(select realm_key from sakai_realm_rl_gr where user_id = 'cwrks223')
-- works for this user in my DB due to sakai_id == EID for user
This confirms that SU strips sakai site associates for out of term sites.
- on the second computer I run Site Sync for Su07-X0X0
- on the first computer ( cwrks223 account ) I nav to a tab
- on the first computer my site membership returns
This confirms that Sync Site, when it refreshes the Site against the updated CM data, restores the user to the various Sakai side records.
- in SQL developer I repeat the above query
- the user is now in the top level Realm for the site
- the user is now in the proper section Realm for the site.
- on the first computer I navigate to the Membership tool. cwrks223 membership now shows the proper list.
to test "update participants' refreshing
- on second computer SU to cwrks223
- on second computer cwrks223 confirm that I don't see the Su07-XOXO site
- on first computer cwrks223 tab-nav to a site, and lose Su07-XOXO membership
- on second computer logout cwrks223 and login caseyd1
- on second computer caseyd1 tab-nav to Su07-XOXO and Update Participants.
- on second computer caseyd1 see cwrks223 return to membership list.
- on first computer cwrks223 tab-navs to a site, and regains Su07-XoXO membership
this confirms that an admin or instructor using Update Participants for a site restores sakak-lost members by refreshing against unchanged CM data.
Dialing in on the problem leaves us with BecomeUser being the starting point for investigation. Our Term-based optimizations are being used by the Sakai Kernel to remove prior terms for those users who have been "Become Usered" by our Support team. well that sucks.
A patch is to remove the refreshUser call from Become User. The privs are actually calculated when the user navs to the Site in question. think 'lazy loading.' :)
This is the second place where we've removed the call. The first is in the login sequence. I think that Sakai is resolving User membership rights when the user navs to the site just fine.---
I think that the 2.1 era addition of the SAKAI_SITE_USER table allows us to drop a lot of the confusing Realm scrubbing. It just wasn't reviewed at the time.
Tuesday, December 18, 2007
One of the hard parts of working with our QA crew is getting a common model of how Sakai operates informing their efforts. So much seems magical / inconsistent / random that they often throw up their hands in the bug reports and trail off into "what did we miss?" and "this happens now and again..."
The practice, with stable code, should be deterministic.
One source of the problem is in basing QA on clones of production databases and using the production integration points. This practice comes from laziness. I can understand it as QA is in reality the second tier of support at Stanford. (there is only 1 support person for our few thousand users. Go cost cutting measures!) One reason QA uses live data is because so often they are pulled in to examine production issues.
As a result the development of formal test cases has been seen as a luxury.
The production of formal test cases would follow an understanding of the Sakai models - and internalizing that is just not going to happen in such an environment. One would need a couple more QA people focusing solely on QA, and not "where is my worksite? oh under that pulldown?" issues.
The construction is astounding. The weekend traffic was boggling. The snow was nice; spent an afternoon tubing with Child while younger kids sledded and M1 snowshoed with Mr Cox, singlespeedjane, JoeB, and TonkaFastButt. JimmyD and Kate oscillated back and forth.
Conversation surrounded changes in our lives; it seems that a lot of things are up in the air.
Tuesday, December 11, 2007
As I didn't ( still don't ) have QA LDAP servers to work with I had to hope that it worked. In retrospect I could of played with my DNS settings to make some fail to resolve...
During a scheduled outage of an LDAP server (during finals week? I don't get it. ) today it seems that tearing down the entire secure connection context isn't enough to avoid the sticky IP problem. As a result post Kerberos AuthN work against the LDAP pool wasn't working correctly for a good number of users - and they couldn't completely log into Sakai. We're not sure if the TTL for the DNS entry was set low enough before the outage, but assuming it was I expected the new from-the-bottom connection to do a full DNS request and get only the good IPs.
It looks like we'll have to set a couple of JVM properties: networkaddress.cache.ttl being the first one. The default is to have the JVM cache successful DNS resolutions forever. So even if the JAAS context was built up each time the JVM would still of been giving it the wrong machine to work with. Ugh.
Some Stanford teams actually roll through IP #s for LDAP connections, recovering from bad contexts by sidestepping the DNS loadbalancer altogether.
Friday, December 7, 2007
In itself this is 'fine' but it has the result of leaving all the Section Aware Tool ACLs, stuffed away in sad little XML packages hither and yon, referring to a non-existing Group. Sadness can result, as the changes to the membership are not reflected in the Tools ACL.
So some folks can't get their stuff who should be able to, and others who should not get stuff no longer theirs may.
I am told that when all the XML is dissolved into columns this problem "may" go away. Altho I doubt it: some tool-wide notification mechanism will be necessary. (That would be fun to join in on, perhaps in my upcoming free time when I'm no longer employed by Stanford.) And I need a 2.4.x era fix.
I'm going to try the old Switcheroo in the SectionManagerImpl - let it create the new group thingy but then rename it to the old, deleted thingy's name. evil. yes. If that works it's an easy hack we can get into our pre-break release.
Then I can look at a real solution.
Wednesday, December 5, 2007
The blog posting, http://bfish.xaedalus.net/?p=239 , works fine - but expect some oddness as this is all early release stuff.
I haven't looked at the Sakai Schedule / Calendar tool, but a Lightning Provider for Sakai would be sweet.
More information on Lightning can be found at the Mozilla Wiki's Calendar area.
Tuesday, December 4, 2007
I, however am chasing Production and QA issues :P
The Cloaking Device is in place for our load testing. It's been laid over a snapshot of Production.. now we have thousands of 'fake people' to allow our vendors and consultants to work with.
However we have a problem with our 2.4.x_D1 deployments - Lydia's is not working properly and mine is. After I fix production issues I'll be back looking at our svn:externs to see if the current QA tag is pulling something different.
The symptom is that her deployment is not showing the cloaked users. Mine does. Weird.
Tuesday, November 27, 2007
The hunt for contractors continues. spcox provides a reference for his pal Eric, M1 is going to talk to Erick dad of Adriaan, and I have to write a note to drop in Joe-The-Neighborhood-Fireman-and-part-time-Contractor's mailbox. I have to call the drainage contractors to see where their proposal has gone.
The flood lost-room is still rather lost. I was going to tear off all the sheetrock to see what is underneath (mud? mold? moles?) but my eagerness as waned. I blame the flu.
Today we have a conference call with AppLabs. They are the firm we've identified to help us create 'real load' against our Sakai QA environment. We don't have the capability to flog the Sakai deployment with enough boxes, with enough boxes, to produce realistic levels.
The Cloaking Device is in suitable shape for this testing. I've written about it a bit at the Sakai Confluence wiki, and in the internal Stanford Academic Computing wiki (ConSUL), so I'll hold off on describing deeply here. Shortly it is a set of Groovy scripts which manage 'fake' users in our Sakai databases. The UserDirectoryProvider (UDP) I wrote for Stanford, SKrbLDAP, gets two new configuration properties which tell it to use fake First and Last names for fake Ids, yet to use the real Id to resolve LDAP bio / affiliation information. This latter is needed to drive our Sakai User Type settings.
The AppLab testing will use these alternate Ids; up to 1500 of them. While it is true you can often stress test a system with just a few Ids I wanted to stress test the UDP's Caching. That requires Real Ids. I also want to see where we get into the bounds of needed an LDAP connection pool to the Stanford OpenLDAP installation.
I call these modified users "cloaked users." Their FN/LN comes from CourseWork Classic's PERSON table, just all mangled up - random selections of names are persisted into the Cloaking Device. The fake names persist to allow QA to build up test cases for re-use and automation. The fake Ids are randomly generated in a format outside of the Stanford SUnetID namespace.
I've gotten permission to provide the cloaked-user filled CM database tables to the Sakai Project. They could be used as a basis for load-testing CM, or for default deployments from standard check-outs. I suspect that there will be wrinkles around our EID formats, but those can be resolved.
Monday, November 26, 2007
As I hang out drinking orange juice I fired up MediaWiki @ badubadu.com, www.badubadu.com/wiki which will save some time. Updated the home page to list this blog and the wiki... that and sshfs from google and publishing is easy.
The "old systems" at hurricane electric are using older versions of PHP and MySQL, as you may expect. MediaWiki 1.6.10 did the trick. I patched the deployment with http://www.mediawiki.org/wiki/Extension:Page_access_restriction which, after some confusion on my part seems to be working well.
The SSL configuration on these older machines is a bit awkward; there is one SSL cert for the entire machine, and so you have to use a machine specific URL.
some email from the office, and then back to bed.
Saturday, November 24, 2007
In spite of it we got a trail walk and some hillside scrambling in! Some photowork
for holiday presents and M1's new blog.
Let more folks know I won't be at the next Sakai conference. For various reasons I haven't been in a few years; mostly centering around Stanford needing someone around who can fix things.
Friday, November 23, 2007
I blog all the time at Stanford, in the Academic Computing Confluence depoyment 'Consul.' - it's a great way to create an ongoing worklog easily linked back to the wiki documentation. That's great, but it's a walled garden. I've been describing the work in my personal area of the Sakai Confluence deployment, but that too is off-the-web. Feh!
So back to this blog.
When I use a machine not using our home connectivity provider (Sprint) I find that I can get to the blog hosted at blogspot. However from home I'm continually directed to the classic badubadu.com site.
When, from home, I ping, I get hurricane electric - where badubadu.com is hosted. when I dig from home I get google's blogspot servers.
weird. I would expect that once my home machines got the right path (via dig) all would be well.
more digging. I think.
Digging was pleasantly interrupted by a family bike ride in Pursima Creek OSP. Our first visit and a good time.
Thursday, November 22, 2007
Wednesday, February 7, 2007
Tuesday, February 6, 2007
The Apple folks are behind the Sun release Kerberos wise.
This is throwing off all my developer-metrics for these jobs. I had to write an alternative LDAP class and inject it into my UDP - this class reads users from XML and caches them. If something references a user outside of the XML the new user is created in the memory cache and everything continues. All this is much faster than reaching out to the mighty OpenLDAP pool.
It is pretty darn ugly. No java exceptions at this point. Time to read some SF and have a nice drink of water.
I'll revamp to add my sunetid as a departmental aid / instructor in the AM and see if I can build Sakai sites as desired.
Monday, February 5, 2007
While we were away Kid and Catherine B. built a giant factory of legos. It must of kept them going all week. We all agreed that it must be documented before it is torn down. I though this would be simple - some stills, or mebby some DV footage.
of course time passed.
Remembering a fun day at Zuem, courtesy of Aaron and Isacc, I hunted down the stop-motion movie program they use. I think it is the same one which sucked down hours at the Maker Faire last year. Anyhoo I got it running on an old G4 laptop Mitchell had laying around, and grabbed a spare iSight, and ...
There went the weekend, aside from hacking on Sakai and a trip to Planet Granite to lift.
Kid and I agreed that this was, of course, an experimental movie. I set this expectation because heck, who knows if this is going to work at all? It worked wildly.
I think I can sneak in some math - the #frames per second and # frames per image stuff Kid is picking up well. I'm going to shoot for graphing by using the umbitiquous Craola washable markers and marking the Lego grids we're building on - try to map up to "feet per second" in MiniFig scale.
(Kid is consuming a book he calls "the Brick-o-pedia" madly)
more fun than I expected. :)