Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Load up npm #4153

Merged
merged 73 commits into from
Oct 26, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
7ed4ff4
Junk, mostly. Some salvagable
chadwhitacre Oct 22, 2016
9113adb
Stub registry request.
aandis Oct 22, 2016
82f0c8a
Don't read readme since it's not there.
aandis Oct 22, 2016
423cea8
Write insertion data to io for copy.
aandis Oct 22, 2016
2b06b03
Add foreign key constraint in packages.
aandis Oct 23, 2016
7774bc0
Add id column in emails and foreign key constraint to pacakges.
aandis Oct 23, 2016
43c5831
Create npm class and modify insert method.
aandis Oct 23, 2016
7c14257
Insert mtime if it exists else default to now.
aandis Oct 23, 2016
1a927f0
Check if package exists before inserting.
aandis Oct 23, 2016
e527007
Make sure it's a package before inserting.
aandis Oct 23, 2016
076458b
Prune dead function
chadwhitacre Oct 24, 2016
ca5f81f
Clean up a couple obvious bugs
chadwhitacre Oct 24, 2016
9d3ee68
Write a test for insert_catalog_for
chadwhitacre Oct 24, 2016
872c149
Separate cursor usage from catalog reading
chadwhitacre Oct 24, 2016
dbca36d
Remove need to mock in tests
chadwhitacre Oct 24, 2016
f2ab400
Denormalize package_manager
chadwhitacre Oct 24, 2016
f2b5578
Here's a version that uses ijson
chadwhitacre Oct 24, 2016
776dd3c
Log stats at the end, too, for final count/time
chadwhitacre Oct 24, 2016
16d52fc
Success! We have data from npm getting into pg!
chadwhitacre Oct 24, 2016
e201a3c
Can we just call it readme? Let's call it readme.
chadwhitacre Oct 24, 2016
7dfc975
I think it worked! :O
chadwhitacre Oct 24, 2016
e55c58e
Let's give credit
chadwhitacre Oct 24, 2016
e52e0e4
Test of tantalizing closeness!
chadwhitacre Oct 25, 2016
13f5f5f
Clean up handling of escape/quote characters
chadwhitacre Oct 25, 2016
7cf1d6c
Merge newer work with older
chadwhitacre Oct 25, 2016
864c4d6
How about this, Travis?
chadwhitacre Oct 25, 2016
a388158
This?
chadwhitacre Oct 25, 2016
1dffa6f
Bash, smash, ...
chadwhitacre Oct 25, 2016
639405c
Smash, bash, ...
chadwhitacre Oct 25, 2016
3163f8c
Floo
chadwhitacre Oct 25, 2016
a0b387e
AAAAHHHHHHHHHHH
chadwhitacre Oct 25, 2016
93efb64
SJHFJDKKJK
chadwhitacre Oct 26, 2016
91d5e75
Or this?
chadwhitacre Oct 26, 2016
fe0d6b8
Is this it?
chadwhitacre Oct 26, 2016
e33ac7e
Oh yeah ...
chadwhitacre Oct 26, 2016
428e5ba
Try this URL ...
chadwhitacre Oct 26, 2016
d9400af
Close! Close?
chadwhitacre Oct 26, 2016
3e1d4a1
Eeeeeeeeeeee!!!!!
chadwhitacre Oct 26, 2016
fc39b24
One of these commits ...
chadwhitacre Oct 26, 2016
1a9f245
More flailing. :(
chadwhitacre Oct 26, 2016
2ab8f7d
Floo flah
chadwhitacre Oct 26, 2016
954564b
fjdkslf
chadwhitacre Oct 26, 2016
44e41c4
fdosmjklklj
chadwhitacre Oct 26, 2016
9df2c9f
THIS IS IT I JUST KNOW ITall
chadwhitacre Oct 26, 2016
da0639b
Okay that wasn't quite it
chadwhitacre Oct 26, 2016
302562a
Fooooooooo
chadwhitacre Oct 26, 2016
b19c5b0
Fjdsklfjlds
chadwhitacre Oct 26, 2016
ba4b925
ECHO! Echo! echo! ...
chadwhitacre Oct 26, 2016
10f5746
:cry:
chadwhitacre Oct 26, 2016
343bdc7
:horse:
chadwhitacre Oct 26, 2016
6742943
Fdjskfjjjjjjjjj
chadwhitacre Oct 26, 2016
1ca2811
Is this necessary?
chadwhitacre Oct 26, 2016
fcdbe28
Cleanup
chadwhitacre Oct 26, 2016
57aebff
Work around yajl limitation
chadwhitacre Oct 26, 2016
05e0c45
Fix LD_LIBRARY_PATH
chadwhitacre Oct 26, 2016
52e5dde
Is this enough?
chadwhitacre Oct 26, 2016
3595c18
Harumph. I guess we drop back to /usr/local?
chadwhitacre Oct 26, 2016
f5c4f37
Force a rebuild of yajl
chadwhitacre Oct 26, 2016
366dc86
Cool. So does this work now?
chadwhitacre Oct 26, 2016
dde52db
Turn the rest of the tests back on
chadwhitacre Oct 26, 2016
ea143be
Remove a logging line
chadwhitacre Oct 26, 2016
283a26f
Start working up a script for Heroku
chadwhitacre Oct 26, 2016
2ec25ae
Okay! Let the Heroku bashing commence!
chadwhitacre Oct 26, 2016
de7821d
Groan. YAJL requires cmake. O.o
chadwhitacre Oct 26, 2016
d46ec97
We can only write to /app
chadwhitacre Oct 26, 2016
468e395
Gosh. Big tarball to download 24 times a day! O.o
chadwhitacre Oct 26, 2016
4077a38
Curl writes to stdout
chadwhitacre Oct 26, 2016
7f519c1
Fiddle with checksum check
chadwhitacre Oct 26, 2016
8cc5602
Fix sha256sum call
chadwhitacre Oct 26, 2016
f4f1655
We don't have to test, it just fails
chadwhitacre Oct 26, 2016
03142c0
Put libyajl where we can find it
chadwhitacre Oct 26, 2016
56ee188
Add some comments
chadwhitacre Oct 26, 2016
cc8ea2c
Switch to the real URL! Ready to deploy! :O
chadwhitacre Oct 26, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,28 @@ branches:
- master
before_install:
- git branch -vv | grep '^*'
- pwd

# Sometimes ya just halfta ...
- test -d yajl || git clone https://github.com/lloyd/yajl.git && cd yajl && git checkout 2.1.0
- test -f Makefile || ./configure && sudo make install && cd ..

- npm install -g marky-markdown
cache:
directories:
- env/bin
- env/lib/python2.7/site-packages
- yajl
install:
- if [ "${TRAVIS_BRANCH}" = "master" -a "${TRAVIS_PULL_REQUEST}" = "false" ]; then rm -rf env; fi
- touch requirements.txt package.json
- make env
- npm install -g marky-markdown
- env/bin/pip install --upgrade ijson==2.3.0
before_script:
- echo "DATABASE_URL=dbname=gratipay" | tee -a tests/local.env local.env
- psql -U postgres -c 'CREATE DATABASE "gratipay";'
- if [ "${TRAVIS_BRANCH}" = "master" -a "${TRAVIS_PULL_REQUEST}" = "false" ]; then rm -rfv tests/py/fixtures; fi
script: make bgrun test doc
script: LD_LIBRARY_PATH=/usr/local/lib make bgrun test doc
notifications:
email: false
irc: false
sudo: false
35 changes: 35 additions & 0 deletions bin/sync-npm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/sh
# This is a script to run under the Heroku Scheduler add-on to periodically
# sync our database with the npm registry.

set -e
cd "`dirname $0`/.."

# Install dependencies.
# =====================

# cmake - required by ...
curl https://cmake.org/files/v3.6/cmake-3.6.2-Linux-x86_64.tar.gz > cmake.tgz
echo '5df4b69d9e85093ae78b1070d5cb9f824ce0bdd02528948c3f6a740e240083e5 cmake.tgz' \
| sha256sum -c /dev/stdin --status
tar zxf cmake.tgz
PATH=/app/cmake-3.6.2-Linux-x86_64/bin:$PATH

# yajl
git clone https://github.com/lloyd/yajl.git
cd yajl
git checkout 2.1.0
./configure -p /app/.heroku/python
make install
cd ..

# python
pip install ijson==2.3.0
pip install -e .


# Sync with npm.
# ==============

URL=https://registry.npmjs.com/-/all
curl $URL | sync-npm serialize /dev/stdin | sync-npm upsert /dev/stdin
Empty file.
137 changes: 137 additions & 0 deletions gratipay/package_managers/sync.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
"""Sync our database with package managers. Just npm for now.
"""
from __future__ import absolute_import, division, print_function, unicode_literals

import argparse
import csv
import sys
import time

import ijson.backends.yajl2_cffi as ijson


log = lambda *a: print(*a, file=sys.stderr)


def arrayize(seq):
"""Given a sequence of str, return a Postgres array literal str.
"""
array = []
for item in seq:
assert type(item) is str
escaped = item.replace(b'\\', b'\\\\').replace(b'"', b'\\"')
quoted = b'"' + escaped + b'"'
array.append(quoted)
joined = b', '.join(array)
return b'{' + joined + b'}'


def serialize_one(out, package):
"""Takes a package and emits a serialization suitable for COPY.
"""
if not package or package['name'].startswith('_'):
log('skipping', package)
return 0

row = ( package['package_manager']
, package['name']
, package['description']
, arrayize(package['emails'])
)

out.writerow(row)
return 1


def serialize(args):
"""
"""
path = args.path
parser = ijson.parse(open(path))
start = time.time()
package = None
nprocessed = 0
out = csv.writer(sys.stdout)

def log_stats():
log("processed {} packages in {:3.0f} seconds"
.format(nprocessed, time.time() - start))

for prefix, event, value in parser:

if not prefix and event == b'map_key':

# Flush the current package. We count on the first package being garbage.
processed = serialize_one(out, package)
nprocessed += processed
if processed and not(nprocessed % 1000):
log_stats()

# Start a new package.
package = { 'package_manager': b'npm'
, 'name': value
, 'description': b''
, 'emails': []
}

key = lambda k: package['name'] + b'.' + k

if event == b'string':
assert type(value) is unicode # Who knew? Seems to decode only for `string`.
value = value.encode('utf8')
if prefix == key(b'description'):
package['description'] = value
elif prefix in (key(b'author.email'), key(b'maintainers.item.email')):
package['emails'].append(value)

nprocessed += serialize_one(out, package) # Don't forget the last one!
log_stats()


def upsert(args):
from gratipay import wireup
db = wireup.db(wireup.env())
fp = open(args.path)
with db.get_cursor() as cursor:
assert cursor.connection.encoding == 'UTF8'

# http://tapoueh.org/blog/2013/03/15-batch-update.html
cursor.run("CREATE TEMP TABLE updates (LIKE packages INCLUDING ALL) ON COMMIT DROP")
cursor.copy_expert('COPY updates (package_manager, name, description, emails) '
'FROM STDIN WITH (FORMAT csv)', fp)
cursor.run("""

WITH updated AS (
UPDATE packages p
SET package_manager = u.package_manager
, description = u.description
, emails = u.emails
FROM updates u
WHERE p.name = u.name
RETURNING p.name
)
INSERT INTO packages(package_manager, name, description, emails)
SELECT package_manager, name, description, emails
FROM updates u LEFT JOIN updated USING(name)
WHERE updated.name IS NULL
GROUP BY u.package_manager, u.name, u.description, u.emails

""")


def parse_args(argv):
p = argparse.ArgumentParser()
p.add_argument('command', choices=['serialize', 'upsert'])
p.add_argument('path', help="the path to the input file")
p.add_argument( '-i', '--if_modified_since'
, help='a number of minutes in the past, past which we would like to see new '
'updates (only meaningful for `serialize`; -1 means all!)'
, type=int
, default=-1
)
return p.parse_args(argv)


def main(argv=sys.argv):
args = parse_args(argv[1:])
globals()[args.command](args)
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
, entry_points = { 'console_scripts'
: [ 'payday=gratipay.cli:payday'
, 'fake_data=gratipay.utils.fake_data:main'
, 'sync-npm=gratipay.package_managers.sync:main'
]
}
)
15 changes: 15 additions & 0 deletions sql/branch.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
BEGIN;

CREATE TABLE packages
( id bigserial PRIMARY KEY
, package_manager text NOT NULL
, name text NOT NULL
, description text NOT NULL
, readme text NOT NULL DEFAULT ''
, readme_raw text NOT NULL DEFAULT ''
, readme_type text NOT NULL DEFAULT ''
, emails text[] NOT NULL
, UNIQUE (package_manager, name)
);

END;
63 changes: 63 additions & 0 deletions tests/py/test_npm_sync.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""Tests for syncing npm. Requires a `pip install ijson`, which requires yajl.
"""
from __future__ import absolute_import, division, print_function, unicode_literals

from subprocess import Popen, PIPE

from gratipay.testing import Harness


def load(raw):
serialized = Popen( ('env/bin/sync-npm', 'serialize', '/dev/stdin')
, stdin=PIPE, stdout=PIPE
).communicate(raw)[0]
Popen( ('env/bin/sync-npm', 'upsert', '/dev/stdin')
, stdin=PIPE, stdout=PIPE
).communicate(serialized)[0]


class Tests(Harness):

def test_packages_starts_empty(self):
assert self.db.all('select * from packages') == []

# sn - sync-npm

def test_sn_inserts_packages(self):
load(br'''
{ "_updated": 1234567890
, "testing-package":
{ "name":"testing-package"
, "description":"A package for testing"
, "maintainers":[{"email":"[email protected]"}]
, "author": {"email":"[email protected]"}
, "time":{"modified":"2015-09-12T03:03:03.135Z"}
}
}
''')

package = self.db.one('select * from packages')
assert package.package_manager == 'npm'
assert package.name == 'testing-package'
assert package.description == 'A package for testing'
assert package.name == 'testing-package'


def test_sn_handles_quoting(self):
load(br'''
{ "_updated": 1234567890
, "testi\\\"ng-pa\\\"ckage":
{ "name":"testi\\\"ng-pa\\\"ckage"
, "description":"A package for \"testing\""
, "maintainers":[{"email":"alice@\"example\".com"}]
, "author": {"email":"\\\\\"bob\\\\\"@example.com"}
, "time":{"modified":"2015-09-12T03:03:03.135Z"}
}
}
''')

package = self.db.one('select * from packages')
assert package.package_manager == 'npm'
assert package.name == r'testi\"ng-pa\"ckage'
assert package.description == 'A package for "testing"'
assert package.emails == ['alice@"example".com', r'\\"bob\\"@example.com']